Closed changrui82 closed 8 years ago
Your analysis is correct but there's one moving part that you have forgotten: the maximum clock offset. Basically what would happen in step four of your analysis is that the transaction will come along with timestamp 2
, but there is a write at 101
. Assuming that you're not running with a MaxOffset
setting which is below the actual offset (~100 in your example), this read will return with a *ReadWithinUncertaintyIntervalError
(which includes a higher timestamp at which the coordinator will retry).
I agree with your analysis though that there's room for improvement there. We could go along with your suggestion of updating the timestamp when the gateway receives a response, but it will not fix the problem entirely.
Assume we update the clock as you suggest but the client's next request goes to a different coordinator which didn't have anything to do with the previous steps but has the same slow clock. Then it will pick a problematic timestamp - so the underlying behaviour doesn't change, though of course it reduces the likelihood of restarts, so I think we should do it.
In the long run, we'll probably want to use HLC timestamps as "causality tokens" - all operations which need to be linearized pass the highest seen timestamp along. That means we don't try to blindly serialize everything, but only operations where it actually matters. We're not doing that now, and it would require client-side support, so it's definitely further down the road.
Another half-baked thought is letting Node
pick the timestamp (as opposed to TxnCoordSender
). But even then, your second transaction may have its first contact with a node which doesn't have anything to do with the conflicting key, but picks a bad timestamp. Since the idea gets in the way of parallelizing DistSender
, I don't think it's a great one.
About MaxOffset
, I think I can set it to 0 if I don't care "External Consistency(strong consistency)".
In my opinion, client-side is out of cockroach server system. And request from different coordinator don't need to conform to "program order" rule in "Sequential Consistency(internal consistency)"
So, I think we can achieve "Sequential Consistency" completely if coordinator update HLC when receive response from range node.
As for "External Consistency", it's higher requirement. MaxOffset
will work for it.
As far as I'm concerned, the "processor" in sequential consistency is the client (that may not be true for your use case, but in general I that's what I would want) and the client could be talking to random nodes through a load balancer. Within a transaction there's no problem because we're pinned to one coordinator, but two sequential transactions could run into the above issue.
But, as said above, I think we should go with your suggestion. I just wanted to point out that it doesn't fully satisfy me yet (I think that would need causality tokens).
MaxOffset
is not just for "external consistency"; we also rely on it for read leases (which is what makes it possible to serve reads from a single node instead of talking to a quorum). Nodes currently panic if they determine that MaxOffset
has been exceeded, so setting MaxOffset
to zero won't work unless your clocks are very precisely synchronized (or for the special case of a "cluster" running on a single machine).
+1 to the original proposal to call Clock.Update
on all responses.
In the same vein: We should update the coordinator's clock from potential timestamps in returned transactions. For example, after a Push, the clock should have received the contained timestamp as an Update.
Positive side-effect for SQL: in #4393 we can also pick a current node timestamp on retry (as opposed to retrying with what was returned by the retry error as the minimum timestamp to use).
cc @andreimatei
Timestamp chaos is gone, so the only thing left to do here is add a Now
timestamp field to BatchResponse_Header
and Error
which is set to now
in (*Store).Send
before sending anything back to the client, and using that at the gateway to update the hybrid logical clock. Probably the most work (but not a lot) is writing unit tests that a) verify that the store correctly populates the field and b) that DistSender
updates the HLC based on received responses' Now
field. Both can be nicely focused, so not much work.
@RaduBerinde, already pushed a few things your way but here's one more - feel free to bounce it off to someone else if your queue is full. With the instructions above, I don't think a lot of familiarity with the code is required, so most of us with capacity could do it. see below
cc @knz regarding my second-to-last comment - you'll need this issue addressed before your timestamp change is 101% correct (see the issue discussed at the beginning). Come to think of it, you're probably the right person for this job if you have a little bit of time in your queue. If not, see the strikethrough above.
@tschottdorf now looking at this. There are many places where a BatchResponse
and Error
are allocated and initialized. What is the recommended place to set Now
. I am thinking about the topmost common caller, that is kv/send.go func send(...)
. Does this seem right?
Then another related question, where are Error
object received and processed?
No, set it in (*Store.Send)
in the defer that already updates pErr
and
br
. kv
is the wrong place because that's on the coordinator side -
that's where you want to incorporate the update into the clock (I'm
thinking close to kv.send
is that right place for that update).
On Wed, Mar 16, 2016 at 2:21 PM kena notifications@github.com wrote:
@tschottdorf https://github.com/tschottdorf now looking at this. There are many places where a BatchResponse and Error are allocated and initialized. What is the recommended place to set Now. I am thinking about the topmost common caller, that is kv/send.go func send(...). Does this seem right?
Then another related question, where are Error object received and processed?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/3525#issuecomment-197468883
Currently,
func (c *Clock) Update(rt roachpb.Timestamp)
only invoked when range node receive request from gateway node. When gateway node receive response from range node, this function doesn't invoked. So, there will be some problem: 1) If HLC for nodeA is '1', for nodeB is '100' now. 2) nodeB(as gateway node) start a txn to read keyA from nodeB(as range node), then we have read timestamp '100' for keyA intsCache
on nodeB. Then the txn commit at once. 3) nodeA(as gateway node) start a txn to write keyA to nodeB(as range node) with timestamp '1'(txn fetch the timestamp from the gateway node: nodeA). We check the read timestamp for keyA fromtsCache
on nodeB, and decide to push the txn timestamp to '101'. Then the txn commit at once. 4) nodeA doesn't invodefunc (c *Clock) Update(rt roachpb.Timestamp)
when it receive response from nodeB. Then nodeA start a new txn to read keyA from nodeB with timestamp '2'. So, this new txn can't see the newest value(with timestamp '101') which updated by nodeA self. It's not conform to "program order" rule in "Sequential Consistency"So, I think we need to invoke
func (c *Clock) Update(rt roachpb.Timestamp)
when gateway node receive response from range node. Then nodeA can read keyA with timestamp '102'.