Support for read-only availability when in a minority partition

archiecobbs commented 8 years ago

One application that I work with has somewhat unique requirements.

I'm documenting those requirements here, really just "for the record" because this is an oddball usage scenario that you may not have thought about before. Whether you deem it important/relevant enough to support it is another question... but FWIW at least there is one actual example of this being done in a production environment. So consider this a long shot feature request.

The basic idea is that we want nodes to be able to continue to read (but not write) the database even if they are partitioned in a minority.

The requirements for doing this are as follows:

You have a small (e.g., 5) set of nodes working together to maintain a consistent, distributed database
As long as a node is part of a majority, everything works normally
If something occurs that makes a node unable to communicate with the majority (e.g., it gets partitioned off by itself), or any other weirdness external to the node happens that makes the clustered database unavailable, the node has a way to reliably and relatively quickly detect this scenario, and the node goes into "standalone" mode.
- For example, we may decree that after three consecutive attempts to commit a fully consistent read transaction result in retry errors, revert to standalone mode. If a subsequent attempt succeeds, exit standalone mode.
While in standalone mode, the node is still able to access, via a read-only transaction, the entire database, in a state at least as current as what it saw in the most recently committed (local) transaction prior to the partition.
- That is, it must be possible to access the database in a read-only transaction that provides reduced consistency (i.e., last known committed state rather than current committed state) but does not require any network communication. This means there must be an SQL command to configure this reduced consistency level for the current transaction.
- From a Raft point of view, think "read from the most recently committed Raft state known locally".
- Because the entire database must be available, this means the cluster must be configurable for full redundancy.

Obviously this is a situation where we are optimizing for application continuity (albeit hobbled by restricting to read-only transaction snapshots from the point in time of the partition) rather than scalability.

Jira issue: CRDB-6189

andreimatei commented 8 years ago

Like you say, this only generally works if the "the cluster must be configurable for full redundancy", otherwise it's utility goes down quickly. We don't have a good way to enforce this "must be configurable" part - you'd be responsible for setting your replication factor to be the size of the cluster with no warning if this ceases to be the case because you've started too many nodes.

There's generally been talk recently about supporting reads at older timestamps, which presumably wouldn't require going through the raft leader. I think it's very likely we'll have something like that at some point, although I'm not sure we've reached a conclusion about when and how to implement it exactly. In any case, it probably wouldn't be based on detecting some "standalone" state for a node, but rather on the user explicitly indicating an older timestamp.

On Tue, Apr 19, 2016 at 10:26 AM, Archie L. Cobbs notifications@github.com wrote:

One application that I work with has somewhat unique requirements.

I'm documenting those requirements here, really just "for the record" because this is an oddball usage scenario that you may not have thought about before. Whether you deem it important/relevant enough to support it is another question... but FWIW at least there is one actual example of this being done in a production environment. So consider this a long shot feature request.

The basic idea is that we want nodes to be able to continue to read (but not write) the database even if they are partitioned in a minority.

The requirements for doing this are as follows:

You have a small (e.g., 5) set of nodes working together to maintain a consistent, distributed database

As long as a node is part of a majority, everything works normally

If something occurs that makes a node unable to communicate with the majority (e.g., it gets partitioned off by itself), or any other weirdness external to the node happens that makes the clustered database unavailable, the node has a way to reliably and relatively quickly detect this scenario, and the node goes into "standalone" mode.

For example, we may decree that after three consecutive attempts to commit a fully consistent read transaction result in retry errors, revert to standalone mode. If a subsequent attempt succeeds, exit standalone mode.

While in standalone mode, the node is still able to access, via a read-only transaction, the entire database, in a state at least as current as what it saw in the most recently committed (local) transaction prior to the partition.

That is, it must be possible to access the database in a read-only transaction that provides reduced consistency (i.e., last known committed state rather than current committed state) but does not require any network communication. This means there must be an SQL command to configure this reduced consistency level for the current transaction.

From a Raft point of view, think "read from the most recently committed Raft state known locally".

Because the entire database must be available, this means the cluster must be configurable for full redundancy.

Obviously this is a situation where we are optimizing for application continuity (albeit hobbled by restricting to read-only transaction snapshots from the point in time of the partition) rather than scalability.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/6150

andreimatei commented 8 years ago

Looks like people are discussing historical reads as we speak at https://github.com/cockroachdb/cockroach/issues/5963

On Tue, Apr 19, 2016 at 11:19 AM, Andrei Matei andrei@cockroachlabs.com wrote:

Like you say, this only generally works if the "the cluster must be configurable for full redundancy", otherwise it's utility goes down quickly. We don't have a good way to enforce this "must be configurable" part - you'd be responsible for setting your replication factor to be the size of the cluster with no warning if this ceases to be the case because you've started too many nodes.

There's generally been talk recently about supporting reads at older timestamps, which presumably wouldn't require going through the raft leader. I think it's very likely we'll have something like that at some point, although I'm not sure we've reached a conclusion about when and how to implement it exactly. In any case, it probably wouldn't be based on detecting some "standalone" state for a node, but rather on the user explicitly indicating an older timestamp.

On Tue, Apr 19, 2016 at 10:26 AM, Archie L. Cobbs < notifications@github.com> wrote:

One application that I work with has somewhat unique requirements.

I'm documenting those requirements here, really just "for the record" because this is an oddball usage scenario that you may not have thought about before. Whether you deem it important/relevant enough to support it is another question... but FWIW at least there is one actual example of this being done in a production environment. So consider this a long shot feature request.

The basic idea is that we want nodes to be able to continue to read (but not write) the database even if they are partitioned in a minority.

The requirements for doing this are as follows:

You have a small (e.g., 5) set of nodes working together to maintain a consistent, distributed database

As long as a node is part of a majority, everything works normally

If something occurs that makes a node unable to communicate with the majority (e.g., it gets partitioned off by itself), or any other weirdness external to the node happens that makes the clustered database unavailable, the node has a way to reliably and relatively quickly detect this scenario, and the node goes into "standalone" mode.

For example, we may decree that after three consecutive attempts to commit a fully consistent read transaction result in retry errors, revert to standalone mode. If a subsequent attempt succeeds, exit standalone mode.

While in standalone mode, the node is still able to access, via a read-only transaction, the entire database, in a state at least as current as what it saw in the most recently committed (local) transaction prior to the partition.

That is, it must be possible to access the database in a read-only transaction that provides reduced consistency (i.e., last known committed state rather than current committed state) but does not require any network communication. This means there must be an SQL command to configure this reduced consistency level for the current transaction.

From a Raft point of view, think "read from the most recently committed Raft state known locally".

Because the entire database must be available, this means the cluster must be configurable for full redundancy.

Obviously this is a situation where we are optimizing for application continuity (albeit hobbled by restricting to read-only transaction snapshots from the point in time of the partition) rather than scalability.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/6150

archiecobbs commented 8 years ago

@andreimatei Regarding full redundancy, I agree it's not the database's job to ensure you have configured it the way you want. It simply needs to support full redundancy, which is easy.

Regarding #5963 that's talking about a much more complicated and general requirement than what's needed here. This issue's requirement doesn't need to be able to specify the timestamp; it only needs to be able to say "read me the most up-to-date, committed data you have"... and do so in a manner that doesn't depend on communication with other nodes.

Really this is not "time travel", but rather a weaker form of consistency which has the benefit of no network communication required. These transactions will therefore be extremely fast and very likely useful in other contexts, where this speed vs. consistency trade-off is worthwhile.

bdarnell commented 8 years ago

you'd be responsible for setting your replication factor to be the size of the cluster

The replication factor doesn't have to be the size of the cluster: certain kinds of failures could be handled with other configurations. For example, if the DB has three replicas in each of three datacenters, then any one datacenter has a complete copy of the DB.

it only needs to be able to say "read me the most up-to-date, committed data you have"... and do so in a manner that doesn't depend on communication with other nodes.

We have an inconsistent read mode as well (#4558), which gives you the latest data that the node you're talking to knows about. But I don't think you really want to use it: it won't give you a consistent snapshot across nodes, so you'll see problems when indexes are out of date with their data, etc. There is also no guarantee that this will give you anything close to the latest data that was available at the time of the partition - I don't think it will be unusual for inconsistent reads to return data that is hours out of date (and wildly inconsistent across ranges), especially if the network is having problems that eventually grew into a full partition.

I think what you want is to read at a specific timestamp; the problem is that the database can't tell you what timestamp to read at - the partition didn't occur at a known instant. You could poll all the nodes to see the last time their data was changed, but does a value in the past indicate when the problems occurred, or was the data on that node just not changing much at the time? I think the best solution for your application would be to select a timestamp some interval (a few minutes? depends on how quickly you detect the need to go into read-only mode) before you went into read-only mode and do all your queries at that timestamp until the cluster recovers. Some queries may still fail if you don't have a local copy that is up to date at the selected timestamp, and then you'll need to make your own decision about whether you want to switch to an older timestamp or just accept the failure.

archiecobbs commented 8 years ago

@bdarnell Thanks for the thoughts & perspective.

Regarding this idea:

select a timestamp some interval (a few minutes? depends on how quickly you detect the need to go into read-only mode) before you went into read-only mode and do all your queries at that timestamp until the cluster recovers

That's a reasonable approach, but I think we'd need to ensure that it doesn't make time appear to go backwards (it's OK if it makes time appear to stop).

That means somehow choosing a timestamp not prior to the timestamps associated with any recently committed transactions on the local node (is such a notion well defined?).

And even if you could know that timestamp, I'm wondering whether you are guaranteed that the entire database is up-to-date wrt. that timestamp.. i.e., you may only be guaranteed to be up to date with the part of the database that was accessed by those recently committed transaction(s))... ?

bdarnell commented 8 years ago

You can't ensure that you're not moving backwards in time if you're choosing to prioritize availability during partitions. Transactions don't commit locally and propagate to the remote nodes; they commit at the leader and then propagate to the followers. So there's a good chance that the last write you made was to a remote leader and your local replica doesn't have it yet.

Each range replicates independently, so if you wanted to pick the latest timestamp at which you could be sure you could successfully complete a query, you'd have to find the latest write timestamp at each range, and then take the oldest of all those timestamps. This is going to be fairly expensive to discover, and because it depends on the weakest link it is likely to be behind the last write you've made.

archiecobbs commented 8 years ago

Got it... so it doesn't seem feasible with the current design to provide a guarantee where, during a partition, it's possible to have read-only transactions that guarantee (a) time does not go backwards, and (b) you see a consistent image of the entire database.

I suppose one way you could get this is with the addition of a "partition safe" commit variant. This would be an alternative commit operation that would perform a normal commit, but instead of returning it would continue blocking until all ranges were replicated to the local node, at least up to the timestamp of the transaction. Obviously this would make the transaction itself take a lot longer to complete... but it shouldn't reduce overall transaction throughput.

tbg commented 7 years ago

https://github.com/cockroachdb/cockroach/issues/16593 has a related moonshot feature.

andreimatei commented 3 years ago

We're considering some options here for the 21.2 release and onwards.

a-entin commented 3 years ago

Additional requirements to include in the design considerations: https://github.com/cockroachdb/cockroach/issues/61824

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

cockroachdb / cockroach

Support for read-only availability when in a minority partition #6150