fhackett-ms commented 2 years ago

I've been spending a lot of time trying to build a precise user-level understanding of consistency levels for Cosmos DB, and I'm having some fundamental trouble with the logic behind consistent prefix.

This trouble is best explained as two separate points, the first more incidental, and the second more fundamental.

First, I'm confused by the description of allowed and forbidden sequences of writes. To quote:

If writes were performed in the order A, B, C, then a client sees either A, A,B, or A,B,C, but never out-of-order permutations like A,C or B,A,C.

I'm confused why we're mentioning "A,C" here as a forbidden out of order permutation. Clearly, "B,A,C" is out of order, but I can't see why "A,C" is out of order. Logically, if I have a client sending intermittent reads to a Cosmos DB, it would be quite reasonable to see write A then write C, with the write B occurring then being overwritten in between reads.

My second, more fundamental point, is the relationship between quorum reads and quorum writes under consistent prefix consistency level. Consistent prefix is documented to write to a local majority and read from a single replica (see: https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels#consistency-levels-and-throughput). What I don't understand is how that could possibly allow clients to observe the claimed ordering properties.

If writes only need to be acknowledged by a local majority, this means that, for each write, a local minority may hold stale data. Given that read operations may be load balanced to any replica at any time (even if it's unlikely), I can run a thought experiment.

Consider a simple situation with a single write region containing 5 replicas, named 1-5:

Write A is acknowledged by replicas 1, 2, 3. This is a local majority.
Write B is acknowledged by replicas 3, 4, 5. This is also a local majority, they don't have to be the same majority.

Consider that now replicas 1 and 2 store write A and replicas 3, 4, 5 store write B. This is all technically fine, and a stronger consistency read operation could reasonably find the majority value, B.

Now, a client makes two reads with consistent prefix consistency. These reads can reasonably be load balanced to any replica. No other events or replication occur (imagine the client is very fast or the system is having a slow moment). It is entirely possible that:

the first read goes to replica 5 and "sees" write B
then the second read goes to replica 2 and "sees" write A

By this scenario, our client will see "B, A", which directly contradicts the consistency guarantee. Adding another write "C" could recreate the exact sequence the documentation claims is impossible.

Therefore, by this logic, I'm not sure how a client can be relied upon to witness writes according to the stated consistent prefix guarantee. In fact, I can't pin down any practical distinction from eventual consistency.

One last point, for context, notice that, in the TLA+ specifications linked from the documentation, you can also see the two consistency levels being defined identically. The two identical definitions start at this anchor link https://github.com/Azure/azure-cosmos-tla/blob/master/scenario1/swscop.tla#L32 (scenario1/swscop.tla lines 32-37, repeating the same two-line definition for each consistency level).

This whole story is quite confusing from a reader's perspective, so I would appreciate your thoughts.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 2b3fb7da-5483-160d-1781-d4abff9d798d
Version Independent ID: d5e51b7e-7f34-c7df-ce4f-ee4bf5c8059a
Content: Consistency levels in Azure Cosmos DB
Content Source: articles/cosmos-db/consistency-levels.md
Service: cosmos-db
GitHub Login: @seesharprun
Microsoft Alias: sidandrews

GeethaThatipatri-MSFT commented 2 years ago

Hi, @fhackett-ms Thank you for reaching out. At this time we are reviewing the ask and will provide an update as appropriate

seesharprun commented 2 years ago

@markjbrown, do you have any guidance here or a colleague that would like to chime in?

markjbrown commented 2 years ago

Yes, for sure. Adding @abinav2307 who can address these questions. -Thx

seesharprun commented 2 years ago

Hello @abinav2307, can you help answer the questions in this issue?

fhackett-ms commented 2 years ago

Hello @abinav2307 - just checking in: is there any progress on answering, or at least commenting on this issue to some degree? My project that lead to these questions is based on studying Cosmos DB's publicly-available semantics, and having at least some feedback from the documentation authors would be valuable during wrap-up.

saidatta commented 2 years ago

From my understanding of the docs, The consistent Prefix guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

This prevents the scenario when a user reads from the database, they may see some parts of the database in an older state and some in a newer state.

fhackett-ms commented 2 years ago

@saidatta We agree on what that section says. To look at my question "backwards", if that statement is true, what did I specifically get wrong in my counter-example?

Something I said must be wrong, for that statement about consistent prefix to be true in general. I spent a lot of time on modeling and working with these semantics, so can you please specifically point out any mistake I might have made?

A specific mistake in my reasoning would be very important to know.

saidatta commented 2 years ago

This is my 2cents. One of the ways of implementing Consistent prefix is having related records internally delivered to a one or small local majority of partitions. And grouping causality items together and providing them with logical timestamps.

E.g, say data is partitioned on userID interacting with an eCommerce website. If he executed A,B,C,D actions, then this set of actions are stored internally together. Similar to how Kafka provides messages in-order guarantees to its consumer.

fhackett-ms commented 2 years ago

Thank you for your response. I'm now trying to figure out how your comment relates to my original example.

Since you mention partitions, I went and re-studied them a little. I found this section+diagram to be helpful: https://docs.microsoft.com/en-us/azure/cosmos-db/partitioning-overview#replica-sets. By this explanation, partitions are just a way to "stripe" data across regions, a little similar to RAID. If actions A, B, C, D apply to the same record (I assume they do), then all requests will go to the same partition (the same vertical column in that image). This partition would be composed of some number of servers (in my example, 5), and I'm assuming we have just one region for simplicity.

I guess I don't see how partitions are important here, based on what I've read above.

I'm also not sure how this "stored internally together" would work, so let's make a variation of my original scenario, assuming Cosmos DB makes an effort to group causally related events together. I am confident that your idea will work under ideal conditions, so I will add some switch failures Cosmos DB should be able to tolerate, and see if anything "goes wrong".

Imagine the same 5 nodes as my original example. Node 3 is the master; all writes go to it.

Writes A, B, C, D go to node 3, which replicates them together to nodes 1, 2.
Temporary failure, a switch goes down in the datacenter and replicas 1, 2 become inaccessible for a few seconds while things fail over.
During these few seconds, the same client sends write E. There's still a majority of nodes available (3, 4, 5). Does Cosmos DB refuse the request even though it can still perform the majority write it's supposed to do, just because it can't batch it properly? Assuming it services the write, write E ends up on replicas 3, 4, 5.
The client also re-reads what it just wrote, and sees E.
The few seconds are over... but it's a bad day in the data center and 3, 4, 5 become inaccessible now for another few seconds. Only 1, 2 are available, and these replicas could not see any messages from 3, 4, 5 since write D.
The client tries to read again, makes a couple of retries until it reaches replica 1 or 2, and reads D. This is a single replica read, which is OK based on the docs. It could detect that D is out of order with logical timestamps and refuse to service this read until 3, 4, or 5 are back up, but then it would be the same as session consistency.
Second fail-over ends, everything back to normal.

Based on this scenario, I see a few conclusions possible:

Consistent prefix will refuse writes if a minority of replicas become inaccessible (in order to preserve causal request batching), meaning it will drop requests even when strong consistency could still service requests.
Consistent prefix uses a hidden session token on the client side to discard out of order reads
Consistent prefix does neither of these things, and does not offer monotonic reads.

I doubt (1) is the case, as that would be unreasonable. (2) might work, but then isn't it just session consistency with a hidden token, offering the same guarantees? (3) would be my original hypothesis.

It seems (2) or (3) are the plausible conclusions, but (2) is just as weird as (3) in a way, because that would mean consistent prefix is just a special case of session consistency.

P.S. I can provide diagrams of my little scenario above if it helps.

seesharprun commented 2 years ago

We have a new update to the article that should be live later today to address this issue.

please-close

imnaseer commented 2 years ago

I find the explanation in the edits to be confusing. They may have made the text more accurate but does not help the reader understand how the system behaves.

lemmy commented 2 years ago

Related comments elsewhere:

https://github.com/MicrosoftDocs/azure-docs/commit/b16160b2a7d25266c8242d24fd31bd420938c614#r85187515

https://github.com/MicrosoftDocs/azure-docs/commit/b16160b2a7d25266c8242d24fd31bd420938c614#r85280090

MicrosoftDocs / azure-docs

Having trouble distinguishing consistent prefix guarantees from eventual consistency #95928

Document Details

please-close