andreimatei commented 3 years ago

Sentry is getting crash-loop reports from a handful of cluster about closed timestamp regressions Example

store_raft.go:524: log.Fatal: ???: raft closed timestamp regression in cmd: "_3\"*\x83\xe6\xd2s" (term: 23, index: 415262); batch state: 1626099041.832244381,0, command: 1626092389.971770216,0, lease: repl=(n2,s2):3 seq=17 start=1626091869.496126607,0 exp=1626099053.832244381,4775 pro=1626099044.832244381,4775, req: ConditionalPut [×,×), EndTxn(commit:true modified-span (node-liveness)) [×], [txn: fa86e74a], [can-forward-ts], applying at LAI: 183761.

The assertion includes the tail of the Raft log, which should be very valuable in understanding what is going wrong, but that's redacted in Sentry.

Until version 21.1.4, it was possible to hit this assertion on lease acquisitions. But that was fixed (and backported) through https://github.com/cockroachdb/cockroach/issues/66512#issuecomment-862804843 Indeed, some crashes from older version appear to be on lease acquisitions. But we also have crashes on newer version coming from commands proposed by other types of requests.

I've looked at reports from a couple of clusters, and the only commonality I see, which is also weird in itself, is that in multi-node cluster, only 2 nodes report the crash, when you'd expect all the (3) replicas to crash just the same. But, to make it extra confusing, there's also single-node clusters where this happens, so it both looks and doesn't look like a replication issue.

For analyzing the Sentry reports, one thing that helps is going to the very first set of crashes from a cluster (i.e. first crash from every node). Out of those, one crash will correspond to the leaseholder, and that one will have extra information in it: the request that corresponds to the command with the regression. Getting that first crash in Sentry is tricky: you have to get to a view that has a paginated view of all the events for one cluster (example) and then you have to go to the last page by guessing the "cursur" URL argument. The argument is encoded as 0%3A followed by the actual index.

Looking at a few of these requests causing the regression, I don't see a pattern. I've seen EndTxn and I've seen a batch of Merges.

But one thing I do see and find curious is that I've looked at a few cases, and in every one, the regression was for hours. I don't know what to make of that. I'm thinking that a clock juimp perhaps has something to do with it, although I don't see how.

Jira issue: CRDB-10274

sentry-io[bot] commented 3 years ago

Sentry issue: COCKROACHDB-3HX

tbg commented 2 years ago

I just managed to reproduce this, just running this file:

#!/bin/bash

set -euo pipefail

make buildshort
roachprod destroy local || true
roachprod create -n 3 local

roachprod put local cockroach
roachprod start local:1-3
roachprod run local:1 -- ./cockroach workload init kv
roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

roachprod stop local
env COCKROACH_CLOCK_JUMP=8h roachprod start local

roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

roachprod stop local
roachprod start local

roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

and this diff

diff --git a/pkg/util/hlc/hlc.go b/pkg/util/hlc/hlc.go
index 075b461ecb..34abfd342d 100644
--- a/pkg/util/hlc/hlc.go
+++ b/pkg/util/hlc/hlc.go
@@ -15,6 +15,7 @@ import (
    "sync/atomic"
    "time"

+   "github.com/cockroachdb/cockroach/pkg/util/envutil"
    "github.com/cockroachdb/cockroach/pkg/util/log"
    "github.com/cockroachdb/cockroach/pkg/util/syncutil"
    "github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -186,11 +187,13 @@ func (m *HybridManualClock) Resume() {
    m.mu.Unlock()
 }

+var hackClockJump = envutil.EnvOrDefaultDuration("COCKROACH_CLOCK_JUMP", 0)
+
 // UnixNano returns the local machine's physical nanosecond
 // unix epoch timestamp as a convenience to create a HLC via
 // c := hlc.NewClock(hlc.UnixNano, ...).
 func UnixNano() int64 {
-   return timeutil.Now().UnixNano()
+   return timeutil.Now().UnixNano() + hackClockJump.Nanoseconds()
 }

 // NewClock creates a new hybrid logical clock associated with the given

on top of e84001d8611aedc45a413019f5e6317c2b757ca0.

‹/System/{NodeLive…-tsd}›,raft] 43 ???: raft closed timestamp regression in cmd: ">\xa3\x9a\x17#pN\xf9" (term: 8, index: 122); batch state: 164>

tbg commented 2 years ago

Here's n1's log: https://gist.github.com/tbg/1dfc6ee1d324a2971307e3165448275d. The other two nodes survived.

tbg commented 2 years ago

Got this again so I think this repro is pretty good. In fact I'm going to have to disable the assertion so that I can go about what I was actually looking into (https://github.com/cockroachdb/cockroach/issues/74909) :-)

tbg commented 2 years ago

Note, when I disable the assertions it still crashes:

F220117 14:07:42.887954 249 kv/kvserver/replica.go:1269 ⋮ [n1,s1,r3/1:‹/System/{NodeLive…-tsd}›] 142 on-disk and in-memory state diverged: [Raft ClosedTimestamp.WallTime: 1642428459884897696 != 1642457252063853439]

In other words, something pretty bad is going on.

yuzefovich commented 2 years ago

Pinging this issue since we had a few more sentry reports of this error.

andreimatei commented 2 years ago

I think https://github.com/cockroachdb/cockroach/pull/75298 has something to say about this. Tobi can I pass it to you? :P

pav-kv commented 2 years ago

Tried repro https://github.com/cockroachdb/cockroach/issues/70894#issuecomment-1014569388 a few times, no luck catching it so far. Also tried with a negative time shift.

The description of #75298 mentions that only expiration-based leases are affected, so I tried also making (almost) all leases expiration-based to increase chances for the repro to catch this. No luck either.

Looking closer at the failure report #90682, the lease is epoch-based, so maybe both types are affected.

sentry-io[bot] commented 1 year ago

Sentry issue: COCKROACHDB-6BJ

tbg commented 1 year ago

I can reproduce this using https://github.com/cockroachdb/cockroach/pull/97173, looking into it.

tbg commented 1 year ago

Annoyingly, this happens quite readily using #97173 on my M1 Mac, but not on my gceworker. I've disabled time sync[^1] just to be sure but it didn't have an effect. Will continue investigating.

[^1]:

tbg commented 1 year ago

Digging into one of the repros. The command that triggered the assertion^1 did so at index 1302 (i.e. very quickly!) on r69.

F230216 13:46:35.143437 267 kv/kvserver/store_raft.go:644 ⋮ [T1,n3,s3,r69/2:‹/Table/106/1/-{46219…-1}›,raft] 8975 raft closed timestamp regression; batch state: 1676555190.314745000,0 > command: 1676555190.313797000,0, lease: repl=(n3,s3):2 seq=3 start=1676555141.986754000,1 epo=1 pro=1676555166.197260000,0, req: Put [/Table/106/1/‹-3061462225986106197›/‹0›,/Min), EndTxn(commit) [/Table/106/1/‹-3061462225986106197›/‹0›], [txn: 0236eeec], [can-forward-ts].

The command got applied successfully (no forced error), at LAI 1236. It is preceded in the log by a large swath of commands at LAI 1234, all but the first of which must then have been rejected since nothing changes about the lease all through the log:

    raft_log_debug_test.go:51: idx 1241: 65d4c611366565f6: LAI 1230, CT 1676555190.313797000,0, Lease 3
    raft_log_debug_test.go:51: idx 1242: 53198d5b44c55bae: LAI 1231, CT 1676555190.313797000,0, Lease 3
    raft_log_debug_test.go:51: idx 1243: 405fe9c6ccea1c1b: LAI 1232, CT 1676555190.313797000,0, Lease 3
    raft_log_debug_test.go:51: idx 1244: 3d303291ff774d03: LAI 1233, CT 1676555190.313797000,0, Lease 3
    raft_log_debug_test.go:51: idx 1245: 531dfce5714c543b: LAI 1234, CT 1676555190.314745000,0, Lease 3
[more LAI=1234 skipped]
    raft_log_debug_test.go:51: idx 1301: 7d869fdadedf5b33: LAI 1234, CT 1676555190.314745000,0, Lease 3
    raft_log_debug_test.go:51: idx 1302: 5d2b64492a4f1ab4: LAI 1236, CT 1676555190.313797000,0, Lease 3 <-- regression

Looking at the CT, there actually does seem to be a regression. And the regression is relative to the entry at idx 1235, which supplied the CT of 1676555190.314745000, and appears to have applied successfully then.

I'm puzzling over the role of LAI 1234 here. This is exactly when my prototype starts introducing artificially deflated LAIs. But 1234 got assigned naturally here. Why is it special? I've seen this in multiple repros so I don't think this is circumstance. And the command at LAI 1234 isn't doing anything wrong. But the one right after that. And why is there a giant block of LAI 1234 proposals? There isn't a single reproposal in this log - all CmdIDs are unique.

tbg commented 1 year ago

Ok I think I understand at least something. I do now think there is a reproposal. Concretely, here's what I think happens:

range gets created
there isn't immediately a leader, so the first couple thousand writes to it go to the proposal buffer, and on flush raft drops them.
eventually refreshProposalsLocked is called, and it will reinsert all of those proposals into the prop buffer again. Crucially, it sorts them^1 by LAI because proposing out of order is guaranteed to fail most of them.
this time there is a leader and so all commands get into the log, but because we assigned LAI 1234 for 10% of them there is now a large block of commands at that index, and more importantly the order within these is essentially random (the sort is not stable and besides we're sorting out of a random map iteration over r.mu.proposals).

So this explains the log we're seeing.

I think it then also explains the assertion failure.

The smallest closed timestamp we see with LAI 1234 is smaller than that of the offending command at idx 1302. So we had the following real time order:

the first command with LAI 1234 gets proposed at closed timestamp (say) 100.
"our" command gets proposed at closed timestamp (say) 101.
a bunch of other commands get proposed at closed timestamps > 101. Some of them get LAI 1234 because of the injection.
but none of these commands actually go into the log
reproposal sorts them and actually inserts them into the log
but now one of the "injected" LAI 1234 proposals gets to actually apply at LAI 1234. Our command gets to apply at LAI 1236 but trips the CT assertion.

So, did I do something illegal by intentionally assigning incorrect LAIs? The invariant (which I didn't know we relied on, and which I'm not sure is spelled out anywhere) is a monotonicity condition, namely that the assigned (LAI, CT) pairs are increasing, i.e. lai1 < lai2 must imply ct1 <= ct2. By artificially lowering the LAI for some fields we break that.

To introduce the chaos I want without tripping the assertion, I need to assign "fake LAIs" that I know 100% will be rejected. (Right now, I'm assigning a fake LAI that I know was previously assigned, but that doesn't mean my command can't win, as we see here). Their CT will not matter (since rejected commands apply without a CT) and so things should work out. Unfortunately, the LAI is initially 0, which means there's no way to pick an illegal static number. We need to wait for the state machine to apply something first.

So we miss the very interesting first seconds of log where there is lots of reordering.

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

tbg commented 1 year ago

I ended up instead sorting the reproposal slice by (LAI, ClosedTimestamp, Lease) which I thought would avoid this issue. I never saw it again during my testing after that until last night where I randomly got it while repro'ing for #97287 (with the sorting). So, I don't seem to have solved it completely and there's something to be understood yet. Possibly this is then also going to answer how we're seeing this in the wild.

tbg commented 1 year ago

I ended up instead sorting the reproposal slice by (LAI, ClosedTimestamp, Lease) which I thought would avoid this issue. I never saw it again during my testing after that until last night where I randomly got it while repro'ing for #97287 (with the sorting). So, I don't seem to have solved it completely and there's something to be understood yet. Possibly this is then also going to answer how we're seeing this in the wild.

Looking at what I wrote there again, I don't know why that would avoid the issue. We can still have the same problem:

assign LAI 12345, CT=1000 to log position 100
assign LAI 99999, CT=500 to log position 101
boom

So I don't know what shifted to hide this problem in my reproductions, but it should still exist (at least when invalid LAIs are introduced).

tbg commented 1 year ago

Ah - I know. By sorting by closed timestamp, the LAI=1234 block would now pick the smallest possible closed timestamp for slot LAI=1234. One of them had to be smaller than whatever we assigned to LAI=1235 (or higher). So we avoided the assertion, but not for any sound reason. We just reduced the chances of it a ton. Which explains why I haven't been seeing it any more except that one time. I should've looked into that one repro more but didn't: it might've been a different mode, perhaps more relevant to the issue at hand.

pav-kv commented 1 year ago

@tbg How about trying to catch the (MaxLeaseIndex, ClosedTimestamp) inversion early (by adding more assertions 😬 of the invariant mli1 <= mli2 ==> ct1 <= ct2), e.g. right after Sort has reordered the proposals, and/or at Flush time? Maybe your repro will then catch it earlier in the pipeline (rather than only at application time, after it got through Raft).

Is closed timestamp assigned once, during the first proposal flush time? Or is it reassigned on: a) reproposal, b) reproposal with a higher MaxLeaseIndex?

tbg commented 1 year ago

Earlier assertions sound good.

Is closed timestamp assigned once, during the first proposal flush time? Or is it reassigned on: a) reproposal, b) reproposal with a higher MaxLeaseIndex?

It's currently not reassigned, but just because that ended up being slightly more convenient in the current code, not because there's some reason not to. (But there also isn't a reason to, the less change across proposal reuse the better).

I'm going to unassign for now, we'll pick this back up in https://cockroachlabs.atlassian.net/browse/CRDB-25287.

tbg commented 1 year ago

We didn't get around to getting to the bottom of this. The comment^1 explains the understanding I have: there is a monotonicity requirement between MLAI and closed timestamp assignment - the assigned (MLAI,CT) pairs need to be totally ordered by the component-wise order (i.e. if MLAI > MLAI', then CT > CT'). This wasn't true in the experiment I ran (I was artificially introducing backward jumps in MLAI) and I suspect it may not always be true in practice either, though I'm hard-pressed to find an example. One thing to look into could be special commands like lease requests, etc, which don't fully use the LAI mechanism. Could these cause this reversion? I don't think so, but still worth double checking. Similarly, I don't think rejected commands (illegal lease index, invalid proposer lease) can trigger closed timestamp updates.

Pavel's suggestion above has merit, we can verify the total order before commands enter raft, and this could give us an idea of where the problem arises.

It's also totally possible that the problem occurs in practice only due to some corruption in the system ("Ex falso sequitur quodlibet") or that we've since "accidentally" fixed it in the course of simplifying tryReproposeWithNewLeaseIndex^2, though I don't recall reproposals being involved at all (refreshes were involved in my experiment).

One immediate thing we might do is to adopt must^3 once it's available and to report the error to sentry without triggering the crash in production builds^4. We don't even have to wait for must. It is pretty clear that users don't benefit from the assertion; we'll usually just tell them to disable it and go on with their lives.

pav-kv commented 11 months ago

From a recent repro:

$ grep -o "\(Term.*Index.*\|max_lease_index.*closed_timestamp.*replic\)" closed-timestamp-log.txt | sed 's/ replic//g'

Term:6 Index:15 Type:EntryNormal  (ID 0ad4ef7915f2d9f7) by lease #6
max_lease_index:2 closed_timestamp:<wall_time:1699467454251881189 >
Term:6 Index:14 Type:EntryNormal  (ID 4a37f23fa745a2e1) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454363124760 >
Term:6 Index:13 Type:EntryNormal  (ID 174586ad919d0c61) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454220166867 >
Term:6 Index:12 Type:EntryNormal  (ID 4a37f23fa745a2e1) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454363124760 >
Term:6 Index:11 Type:EntryNormal : EMPTY [recovered]

All proposals seem to be made under the same leadership and lease. All proposals are made right in the beginning of a range lifecycle (log indices start at 10 and higher). The EMPTY log entry (index 11) is committed when the raft leader has just been elected.

Observations on the proposals:

Proposals at index 12 and 14 have the same ID, so likely are full-copy reproposals.
Proposal at index 13 was done with the same MLI = 1, so is applied as no-op.
For some reason, proposal at index 15 was made with a lower closed timestamp than proposals 12 and 14.

Probably self-inflicted by the test-only LAI override that returns 1 in some cases when it wanted to return a LAI > 1.

Probably the original ordering had entry 12 and 14 submitted at LAI 3 (since its closed timestamp is the biggest of all), but the test interceptor randomly assigned LAI 1 to it. This is a bug in the test harness.

The failures reported via Sentry fail at higher indices and LAIs though, so the issue is real. The repro is irrelevant.

pav-kv commented 10 months ago

Latest occurrence of this panic in kvnemesis failure here: https://github.com/cockroachdb/cockroach/issues/116541#issuecomment-1880087996.

$ grep -o "\(Term.*Index.*\|max_lease_index.*closed_timestamp.*replic\)" ct.txt | sed 's/ replic//g'                                                        backport23.1-115595-114191!?
Term:6 Index:15 Type:EntryNormal  (ID 450747ef87ce7014) by lease #9
max_lease_index:2 closed_timestamp:<wall_time:1704640779372779513 >
Term:6 Index:14 Type:EntryNormal  (ID 6f1758389525834e) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779402664522 >
Term:6 Index:13 Type:EntryNormal  (ID 5c2068c6aee7e0e0) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779361734754 >
Term:6 Index:12 Type:EntryNormal  (ID 6f1758389525834e) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779402664522 >
Term:6 Index:11 Type:EntryNormal : EMPTY

Possibly same test harness bug as https://github.com/cockroachdb/cockroach/issues/70894#issuecomment-1854893627.

Entries 12 and 14 were probably intended at MLI 3, but the test assigned MLI=1 to them, and one accidentally got submitted before entry 12 (which originally had MLI=1).

The fix would be to disallow the MLI=1 injection until the LAI has definitely passed 1. Then we'll avoid the unintended command reordering. Once we have this fix, nothing stops us from generalizing it: the MLI can be overridden to any value <= submitted LAI, and the command should not be applied in this case.

nicktrav commented 4 months ago

Likely closed as part of #123442. Sentry will tell us if / when we see this again.

pav-kv commented 4 months ago

123442 saw a regression in closed timestamp sidetransport, while this issue is about a regression in the log. Not sure these are the same.

nicktrav commented 4 months ago

cc: @arulajmani - didn't we decide these were similar enough to warrant closing?

nicktrav commented 4 months ago

Re-opening just in case.

cockroachdb / cockroach

closedts: raft closed timestamp regression #70894

123442 saw a regression in closed timestamp sidetransport, while this issue is about a regression in the log. Not sure these are the same.