cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.22k stars 3.82k forks source link

kv: coordinator should update HLC from received responses #3525

Closed changrui82 closed 8 years ago

changrui82 commented 8 years ago

Currently, func (c *Clock) Update(rt roachpb.Timestamp) only invoked when range node receive request from gateway node. When gateway node receive response from range node, this function doesn't invoked. So, there will be some problem: 1) If HLC for nodeA is '1', for nodeB is '100' now. 2) nodeB(as gateway node) start a txn to read keyA from nodeB(as range node), then we have read timestamp '100' for keyA in tsCache on nodeB. Then the txn commit at once. 3) nodeA(as gateway node) start a txn to write keyA to nodeB(as range node) with timestamp '1'(txn fetch the timestamp from the gateway node: nodeA). We check the read timestamp for keyA from tsCache on nodeB, and decide to push the txn timestamp to '101'. Then the txn commit at once. 4) nodeA doesn't invode func (c *Clock) Update(rt roachpb.Timestamp) when it receive response from nodeB. Then nodeA start a new txn to read keyA from nodeB with timestamp '2'. So, this new txn can't see the newest value(with timestamp '101') which updated by nodeA self. It's not conform to "program order" rule in "Sequential Consistency"

So, I think we need to invoke func (c *Clock) Update(rt roachpb.Timestamp) when gateway node receive response from range node. Then nodeA can read keyA with timestamp '102'.

tbg commented 8 years ago

Your analysis is correct but there's one moving part that you have forgotten: the maximum clock offset. Basically what would happen in step four of your analysis is that the transaction will come along with timestamp 2, but there is a write at 101. Assuming that you're not running with a MaxOffset setting which is below the actual offset (~100 in your example), this read will return with a *ReadWithinUncertaintyIntervalError (which includes a higher timestamp at which the coordinator will retry).

I agree with your analysis though that there's room for improvement there. We could go along with your suggestion of updating the timestamp when the gateway receives a response, but it will not fix the problem entirely.

Assume we update the clock as you suggest but the client's next request goes to a different coordinator which didn't have anything to do with the previous steps but has the same slow clock. Then it will pick a problematic timestamp - so the underlying behaviour doesn't change, though of course it reduces the likelihood of restarts, so I think we should do it.

In the long run, we'll probably want to use HLC timestamps as "causality tokens" - all operations which need to be linearized pass the highest seen timestamp along. That means we don't try to blindly serialize everything, but only operations where it actually matters. We're not doing that now, and it would require client-side support, so it's definitely further down the road.

Another half-baked thought is letting Node pick the timestamp (as opposed to TxnCoordSender). But even then, your second transaction may have its first contact with a node which doesn't have anything to do with the conflicting key, but picks a bad timestamp. Since the idea gets in the way of parallelizing DistSender, I don't think it's a great one.

changrui82 commented 8 years ago

About MaxOffset, I think I can set it to 0 if I don't care "External Consistency(strong consistency)". In my opinion, client-side is out of cockroach server system. And request from different coordinator don't need to conform to "program order" rule in "Sequential Consistency(internal consistency)" So, I think we can achieve "Sequential Consistency" completely if coordinator update HLC when receive response from range node. As for "External Consistency", it's higher requirement. MaxOffset will work for it.

tbg commented 8 years ago

As far as I'm concerned, the "processor" in sequential consistency is the client (that may not be true for your use case, but in general I that's what I would want) and the client could be talking to random nodes through a load balancer. Within a transaction there's no problem because we're pinned to one coordinator, but two sequential transactions could run into the above issue.

But, as said above, I think we should go with your suggestion. I just wanted to point out that it doesn't fully satisfy me yet (I think that would need causality tokens).

bdarnell commented 8 years ago

MaxOffset is not just for "external consistency"; we also rely on it for read leases (which is what makes it possible to serve reads from a single node instead of talking to a quorum). Nodes currently panic if they determine that MaxOffset has been exceeded, so setting MaxOffset to zero won't work unless your clocks are very precisely synchronized (or for the special case of a "cluster" running on a single machine).

+1 to the original proposal to call Clock.Update on all responses.

tbg commented 8 years ago

In the same vein: We should update the coordinator's clock from potential timestamps in returned transactions. For example, after a Push, the clock should have received the contained timestamp as an Update.

Positive side-effect for SQL: in #4393 we can also pick a current node timestamp on retry (as opposed to retrying with what was returned by the retry error as the minimum timestamp to use).

cc @andreimatei

tbg commented 8 years ago

Timestamp chaos is gone, so the only thing left to do here is add a Now timestamp field to BatchResponse_Header and Error which is set to now in (*Store).Send before sending anything back to the client, and using that at the gateway to update the hybrid logical clock. Probably the most work (but not a lot) is writing unit tests that a) verify that the store correctly populates the field and b) that DistSender updates the HLC based on received responses' Now field. Both can be nicely focused, so not much work.

@RaduBerinde, already pushed a few things your way but here's one more - feel free to bounce it off to someone else if your queue is full. With the instructions above, I don't think a lot of familiarity with the code is required, so most of us with capacity could do it. see below

tbg commented 8 years ago

cc @knz regarding my second-to-last comment - you'll need this issue addressed before your timestamp change is 101% correct (see the issue discussed at the beginning). Come to think of it, you're probably the right person for this job if you have a little bit of time in your queue. If not, see the strikethrough above.

knz commented 8 years ago

@tschottdorf now looking at this. There are many places where a BatchResponse and Error are allocated and initialized. What is the recommended place to set Now. I am thinking about the topmost common caller, that is kv/send.go func send(...). Does this seem right?

Then another related question, where are Error object received and processed?

tbg commented 8 years ago

No, set it in (*Store.Send) in the defer that already updates pErr and br. kv is the wrong place because that's on the coordinator side - that's where you want to incorporate the update into the clock (I'm thinking close to kv.send is that right place for that update).

On Wed, Mar 16, 2016 at 2:21 PM kena notifications@github.com wrote:

@tschottdorf https://github.com/tschottdorf now looking at this. There are many places where a BatchResponse and Error are allocated and initialized. What is the recommended place to set Now. I am thinking about the topmost common caller, that is kv/send.go func send(...). Does this seem right?

Then another related question, where are Error object received and processed?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/3525#issuecomment-197468883