Open erikgrinaker opened 3 years ago
This is related to #24798.
@andreimatei So it turns out we already have an (unused) method TxnCoordSender.ManualRefresh()
that does this:
Since the streamer library will be calling TxnCoordSender.UpdateRootWithLeafFinalState
to update the txn state, which would update Txn.WriteTimestamp
if it's set, this may already be sufficient (with a few small tweaks to handle leaf txns, i.e. canAutoRetry
). Or we could modify it to take in the refresh timestamp rather than reading it from Txn.WriteTimestamp
. Am I missing anything?
What you say makes sense to me.
Or we could modify it to take in the refresh timestamp rather than reading it from
Txn.WriteTimestamp
.
Taking the timestamp to refresh to as an argument sounds good; if we don't need it, even better I'd say.
I've looked this over, and from what I can tell we have the main pieces we need already in place. I think the streamer would do something like this:
br, pErr := leafTxn.Send(ctx, ba)
if pErr != nil {
canRefreshTxn, refreshTxn := roachpb.CanTransactionRefresh(ctx, pErr)
if !canRefreshTxn {
return pErr.GoError()
}
rootTxn.Sender().UpdateRootWithLeafFinalState(ctx, &roachpb.LeafTxnFinalState{
Txn: *refreshTxn,
RefreshSpans: nil, // set appropriate spans
})
if err := rootTxn.ManualRefresh(); err != nil {
return err
}
}
As such, I'm going to close this for now, but please reopen (or open a new issue) if you find that this isn't sufficient @yuzefovich.
Hey @erikgrinaker, I just rebased my prototype for this on top of latest master, and roachpb.CanTransactionRefresh
is gone, and I can't seem to find its replacement 😅 Could you please point me to it?
Never mind, I found that it was refactored in #73557.
@erikgrinaker I still need your help after all 😄
My current prototype is something like
br, pErr := leafTxn.Send(ctx, ba)
if pErr != nil {
canRefreshTxn, refreshTS := roachpb.TransactionRefreshTimestamp(pErr)
if !canRefreshTxn {
return pErr.GoError()
}
leafTxnFinalState, err := leafTxn.GetLeafTxnFinalState(ctx)
if err != nil {
return err
}
refreshTxn := pErr.GetTxn()
refreshTxn.Refresh(refreshTS)
leafTxnFinalState.Txn = *refreshTxn
rootTxn.Sender().UpdateRootWithLeafFinalState(ctx, leafTxnFinalState)
if err := rootTxn.ManualRefresh(ctx); err != nil {
return err
}
}
Does this look right? What happens to the old leafTxn
is the refresh succeeds? Do I have to create a new leafTxn
?
Hm, I'm not sure if that's going to work. Because refreshTxn.Refresh()
forwards the read and write timestamps:
But then ManualRefresh()
is a noop if these timestamps are equal:
@nvanbenschoten Any thoughts on how to accomplish the above following #73557? Do we need to update ManualRefresh()
or wire up a new method for this?
This is interesting. I agree that refreshTxn.Refresh()
is not the right thing to do for the reason you mentioned. Is it needed though? What would happen if we just did the following? Won't the leafTxn
already be updated with an increased WriteTimestamp
that reflects the pErr
by the time we hear about it?
br, pErr := c.Send(ctx, ba)
if pErr != nil {
canRefreshTxn, refreshTS := roachpb.TransactionRefreshTimestamp(pErr)
if !canRefreshTxn {
return pErr.GoError()
}
leafTxnFinalState, err := leafTxn.GetLeafTxnFinalState(ctx)
if err != nil {
return err
}
rootTxn.Sender().UpdateRootWithLeafFinalState(ctx, leafTxnFinalState)
if err := rootTxn.ManualRefresh(ctx); err != nil {
return err
}
}
Is it on me to try the latest proposal from Nathan and see if it works? Or are you still thinking about how to achieve the manual refresh?
Sorry, this fell through the cracks. I would go with Nathan's suggestion for now, and I'll look this over real quick when I have a chance.
Yeah, I had a quick look, and I think that should work -- the leaf txn should have its WriteTimestamp
bumped by the error, which gets propagated to the root txn here and then refreshed. But it's definitely worth writing an integration test for this.
@erikgrinaker @nvanbenschoten Apologies for having to page the context back in, but I finally got to working on this, and maybe the solution proposed by Nathan doesn't quite work (or at least I'm misusing it).
In https://github.com/yuzefovich/cockroach/tree/streamer-refresh in the last two commits I'm introducing a test for this behavior as well as a fix (which is currently incomplete, but I don't think it's important). With the snippet similar to https://github.com/cockroachdb/cockroach/issues/68051#issuecomment-1013324707, when trying to re-execute the BatchRequest, I'm getting the same injected RWUI error coming from here. It seems as if the TxnCoordSender
of the leafTxn
is poisoned and ManualRefresh
didn't fix it.
Is it the problem with the test where I inject RWUI error and it is not reflected in the refresh spans of LeafTxnFinalState? Or should I be creating a new LeafTxn once the root txn is refreshed?
when trying to re-execute the BatchRequest, I'm getting the same injected RWUI error coming from here
Are you creating a new leaf transaction after the manual refresh? Once a leaf transaction has hit this error, I think it needs to be ingested into the root txn (through UpdateRootWithLeafFinalState
) and then discarded.
Also, could you try changing this line from roachpb.NewError(...)
to roachpb.NewErrorWithTxn(..., ba.Txn)
?
Are you creating a new leaf transaction after the manual refresh?
No, I wasn't creating a new leaf and was using the old one. After creating a fresh LeafTxn the retried BatchRequest goes through, but it still has the old timestamp which goes against my intuition. It looks like we probably should not be just ignoring refreshTS
value from the snippet - how should we incorporate it? My expectation is that after ManualRefresh
and creating a new LeafTxn, the new LeafTxn should be using refreshTS
value as its ReadTimestamp, but it is currently not the case.
Also, could you try changing this line from
roachpb.NewError(...)
toroachpb.NewErrorWithTxn(..., ba.Txn)
?
This doesn't seem to change anything (neither w/ nor w/o re-creating a leaf txn).
but it still has the old timestamp which goes against my intuition
This goes against my intuition as well. I think it has to do with the NewErrorWithTxn
. In cases where a transaction's timestamp is bumped due to a tscache conflict (write-read conflict), a closed timestamp conflict (write-read conflict), a write-too-old conflict (write-write conflict), and most others, the returned Txn already has its WriteTimestamp bumped. That will cause TxnCoordSender.updateStateLocked
to bump the coordinator's timestamp.
We're not doing this in the test, so I think that explains things. However, I'm not sure that we do bump the write timestamp on a ReadWithinUncertaintyInterval
error before performing the rest of the refresh, because a txn can't proceed with evaluation until the error is refreshed away. So the error does not carry a provisional updated Txn.
With that in mind, it does feel like we should be using the refreshTS
straight from TransactionRefreshTimestamp
and feeding that into a new ManualRefreshTo(context.Context, hlc.Timestamp)
method which first forwards the txn's WriteTimestamp to the provided timestamp before performing the refresh.
I see, I think this makes sense, thanks Nathan. It almost worked right away - I started getting
TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh - unknown error: <nil>): "unnamed" meta={id=9e4d6b4c key=/Min pri=0.00941665 epo=0 ts=1653076575.760907001,1 min=1653076575.760907000,0 seq=0} lock=false stat=PENDING rts=1653076575.760907000,0 wto=false gul=1653076576.260907000,0
during ManualRefreshTo
. Apparently, txnSpanRefresher.refreshedTimestamp
was empty causing this error during a refresh, so for now I added a bump to that timestamp in pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go
. Could you please take a quick look at this snippet?
I don't think that snippet is quite right. The handling of txnSpanRefresher.refreshedTimestamp
is pretty funky so I see why you wrote the code that way. However, as written, I think you're bumping the refreshTxn
's ReadTimestamp
to the write timestamp and then using this to set refreshedTimestamp
. So the refresh would be a no-op in this case.
Let me see whether I can clean up this handling of txnSpanRefresher.refreshedTimestamp
so that you don't need to deal with this.
https://github.com/cockroachdb/cockroach/pull/82649 should avoid the need to hack around the empty txnSpanRefresher.refreshedTimestamp
.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
To implement the Index Lookups Memory Limits and Parallelism RFC, the streamer will have to use leaf transactions to run parallel requests (as a root
TxnCoordSender
does not support this).This will prevent the use of automatic span refreshing, instead propagating these errors to the client who will have to retry the transaction. We can avoid this by adding a method to the
TxnCoordSender
that, given a refresh timestamp, explicitly refreshes the read spans and handles errors as appropriate. The streamer will then coordinate in-flight requests, update the root txn with leaf state viaUpdateRootWithLeafFinalState
, and explicitly submit a refresh. The method should ensure we return appropriate info in any errors propagated to clients (i.e. the original keys that caused the refresh, as well as which keys caused the refresh to fail).For more details, see the Hiding ReadWithinUncertaintyInterval errors section in the RFC.
/cc @cockroachdb/kv