cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

storage: consistency check failure during import #36861

Closed nvanbenschoten closed 5 years ago

nvanbenschoten commented 5 years ago

To do / understand


This looks very similar to https://github.com/cockroachdb/cockroach/issues/35424, so it's possible that that issue wasn't fully resolved. I was most of the way through a TPC-C 4k import when a node died due to a consistency check failure.

F190416 01:58:51.634989 172922 storage/replica_consistency.go:220  [n5,consistencyChecker,s5,r590/1:/Table/68/1/{29/4/2…-31/4/1…}] consistency check failed with 1 inconsistent replicas
goroutine 172922 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000056301, 0xc000056300, 0x5449800, 0x1e)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5bdd700, 0xc000000004, 0x5449860, 0x1e, 0xdc, 0xc008bfa5a0, 0x79)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3aa1620, 0xc0071899e0, 0x4, 0x2, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3aa1620, 0xc0071899e0, 0x1, 0xc000000004, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3aa1620, 0xc0071899e0, 0x33b2862, 0x36, 0xc01f06cce0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:220 +0x6ce
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc023eeb400, 0x3aa1620, 0xc0071899e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:229 +0x81b
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc0003de2a0, 0x3aa1620, 0xc0071899e0, 0xc023eeb400, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:125 +0x210

Cockroach SHA: 3ebed10216aa83821b68eab023d3ed650bae8913

Notes:

Cluster: nathan-tpcc-geo (stopped, extended for 48h) Cockroach nodes: 1,2,4,5,7,8,10,11 Inconsistent range: r590 Replicas: nathan-tpcc-geo:2/n2/r3, nathan-tpcc-geo:5/n4/r4, and nathan-tpcc-geo:7/n5/r1 Inconsistent replica: nathan-tpcc-geo:7/n5/r1 Replicas in zones: europe-west2-b, europe-west4-b, and asia-northeast1-b respectively

Initial Investigation

Unlike in the later reproductions of #35424, replica 1's Raft log is an exact prefix of replica 3 and 4's, so this doesn't look like the same issue we saw later in that issue.

I haven't looked at much else yet.

r590 Range Debug Cockroach Console.pdf

tbg commented 5 years ago

We don't have the diff because I messed something up during recent refactors. Running

roachprod ssh nathan-tpcc-geo -- sudo cp -R /mnt/data1/cockroach /mnt/data1/backup

now and then I'll bring the cluster up to recover the diff.

tbg commented 5 years ago

Currently the range seems stats-consistent and also passed enqueuing through the full consistency checker. The replicas are still n2,n4,n5 (replicaIDs 1 3 4), so nothing has shifted. I will dig some more but this could mean that we were looking at a false positive of sorts, or a temporary inconsistency that resolved itself.

I'm also not so sure that I messed up something regarding the consistency checker diff. What's really weird is that typically, the queue first computes a checksum without the diff. Then, when that has finished and found mismatching SHAs, it computes another round this time with a diff, then prints that, and fatals. In the logs, we're supposed to see the log message from line 226.

https://github.com/cockroachdb/cockroach/blob/a078fd2b2b9fd3d0cd0c99b926bc4d9234ff7600/pkg/storage/replica_consistency.go#L224-L231

I don't see that in the (combined) logs from the cluster. But the stack trace shows that we are in the recursive call from on line 229.

A clue is that the fatal error is in its own, new, log file. There aren't any past log files. It looks like the logging messed something up; we effectively know nothing about n5 (nathan-tpcc-geo:7) before the fatal; the fatal message is the very first thing we have.

tschottdorf@nathan-tpcc-geo-0007:/home/nathan/logs$ ls -lah
total 1.1M
drwxrwxr-x 4 nathan nathan 4.0K Apr 16 05:29 .
drwxr-xr-x 5 nathan nathan 4.0K Apr 16 05:29 ..
lrwxrwxrwx 1 nathan nathan   69 Apr 16 01:58 cockroach.log -> cockroach.nathan-tpcc-geo-0007.nathan.2019-04-16T01_58_52Z.003778.log
-rw-rw-r-- 1 nathan nathan 495K Apr 16 01:58 cockroach.nathan-tpcc-geo-0007.nathan.2019-04-16T01_58_52Z.003778.log
-rw-rw-r-- 1 nathan nathan 495K Apr 16 01:58 cockroach.stderr.log
-rw-rw-r-- 1 nathan nathan    0 Apr 16 00:41 cockroach.stdout.log
drwxr-xr-x 2 nathan nathan 4.0K Apr 16 00:41 goroutine_dump
drwxr-xr-x 2 nathan nathan 4.0K Apr 16 00:41 heap_profiler
-rw-rw-r-- 1 nathan nathan  37K Apr 16 02:02 roachprod.log

I was going to suggest that the diff that was printed was so large that it forced a log file rotation and triggered a bug through that, but r590 is actually basically empty -- in Nathan's RangeStatus it has 12 system keys and that's it, now it has 16. The diff couldn't be large. And even if so, I don't know of any bugs regarding log rotation, especially not ones that would completely wipe out any prior log files.

https://github.com/cockroachdb/cockroach/blob/d0f758a345c7a22b03509d35eb19781cbd287061/pkg/util/log/clog.go#L1102-L1110

I even checked Nathan's .bash_history to see if he had moved things around, but of course he didn't. And there also aren't any other users' log dirs that the logs could've gone to (now there's one under my user since I restarted the cluster, but that's not it). All other crdb nodes have two log files, corresponding to the two times the cluster were started (around T00_40_58Z and T00_48_20Z). Only n5 has only this log file dated to the crash.

tbg commented 5 years ago

I ran the following for n=2,5,7 (i.e. the node indexes that have the replica)

roachprod ssh nathan-tpcc-geo:$n -- /home/nathan/cockroach debug range-data /mnt/data1/backup 590 --replicated > 590.$n.repl

and it gets, maybe, more interesting. The ones for n=2,5 are identical. The one for n=7 is very different, which makes some sense since it's only applied up to log position 23, which is presumably where it crashed (22 is the consistency computation at GMT Tuesday, April 16, 2019 1:58:45 AM, 23 an empty entry). (The other two diffs are at 28). Interestingly, at log 23 (on that node) there's tons of data. On the other nodes, a range deletion at index 27 (reproposed at 28) at time 1555379948806436077 (=GMT Tuesday, April 16, 2019 1:59:08 AM) nuked the range empty (save for some sys keys).

raft.2.txt raft.5.txt raft.7.txt

What I've done so far seems to show that when :7 is allowed to catch up to :2 and :5, it'll "look the same", i.e. the range is actually consistent. Whatever the inconsistency was, the range deletions have nuked it.

I don't know exactly what the range deletions were caused by, but my assumption would be that the node crashing failed the import, and that the import tried to clean up by nuking the keyspace empty. This happened within a minute of the crash, so we must expect inconsistencies like that to be wiped out on the regular (though usually we'll at least get the diff). This in turn indicates that we'll want to disable this behavior in our nightly tests. Or even better, we need to take a cluster-wide RocksDB checkpoint. I know @andreimatei was looking at this; this seems high priority now.

On the other hand, since the resulting stats are consistent, this indicates that whatever the diff would have been likely didn't include the MVCCStats key.

I'm adding a little TODO list in the top post to summarize the action items/things to understand. Unfortunately, however, it doesn't seem like we're ever going to be able to find out what the reported inconsistency actually was in this case.

tbg commented 5 years ago

@ajkr this one's shaping up to be another puzzler -- it's similar to that private issue regarding the time series in that the stats don't seem to have diverged.

tbg commented 5 years ago

Oh no.. I think I might understand why the log file is missing:

https://github.com/cockroachdb/cockroach/blob/d0f758a345c7a22b03509d35eb19781cbd287061/pkg/util/log/clog.go#L1318-L1335

LogFilesCombinedMaxSize defaults to 100mb. If we wrote a huge diff, we'd rotate to a new file (we do this after 10mb), and then ask the gc daemon to run. The daemon preserves the latest file, but it will delete the previous file because it bumps us over the 100mb threshold (assuming it is itself way over 100mb).

I'd say it's plausible then to assume that we wrote a large diff and then shot ourselves in the foot by GC'ing it right away.

andreimatei commented 5 years ago

A few runs, nothing so far. More tomorrow. I've rebased my roachtest branch and added a geo-distributed import test here https://github.com/andreimatei/cockroach/tree/repro.geo-import-failure

tbg commented 5 years ago

Please pick up #36867 as well, or we may get a repro that isn't actionable, like we did before.

tbg commented 5 years ago

@nvanbenschoten or @andreimatei could either of you copy the /mnt/data1/backup directories on the roach nodes to S3? I don't know if we'll need them but there's no good reason for letting this cluster run forever. (I'd do it, but I'm out for the night and didn't think of it earlier)

nvanbenschoten commented 5 years ago

I already killed the cluster 😬

tbg commented 5 years ago

You sure?

$ roachprod ssh nathan-tpcc-geo:7
Warning: Permanently added '34.85.67.36' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-1026-gcp x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

1 package can be updated.
1 update is a security update.

New release '18.04.2 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Wed Apr 17 20:05:59 2019 from 188.97.85.189
tbg commented 5 years ago

Oh, I guess it got recreated! Nvm then. Probably wasn't ever going to look at it again anyway.

nvanbenschoten commented 5 years ago

Luckily I just hit this again on a new version of nathan-tpcc-geo. Like last time the raft logs look identical and like last time the logs rolled over and deleted the diff. I'm looking to see if there's anything else I can pull out of the cluster.

I wasn't trying to reproduce this, which is why I didn't have any debug info enabled. All future attempts will be run with #36867 and an increased LogFileMaxSize.

nvanbenschoten commented 5 years ago

The only similarity I see is that like in the first case, this consistency check failure was hit almost immediately after a split. In this case, we see the following entries:

Term:6 Index:11 : EMPTY
Term:6 Index:12  Put: /Local/Range/Table/59/1/114/43738/QueueLastProcessed/"consistencyChecker"
Term:6 Index:13 compute_checksum:<checksum_id:941c5b0f-7846-4546-8cf2-b1a132cdbf71 >
Term:6 Index:14 compute_checksum:<checksum_id:4cc632b2-4210-4eb0-965c-b92735e47fc1 save_snapshot:true >

Notice that term 6, index 11 is the first raft log term/position combo allowed. That indicates that this range had just split off from its parent range.

andreimatei commented 5 years ago

I got a bunch of repros (6 before I stopped the charade). I was running 8 node clusters across 4 GCP zones, 4-cpu machines. I didn't tweak the logging limits, but I believe we have the whole logs for at least 2 of them. We also have RocksDB checkpoints from when the diffs were detected. Briefly looking at two of them:

  1. cluster andrei-1555535390-11-n8cpu4-geo (GCE_PROJECT=andrei-jepsen): Here the diff seems to only contain the stats (part of the RangeAppliedState). Notice that the "expected" counts are negative:

    E190417 22:25:14.161556 86032 storage/replica_consistency.go:134  [n4,consistencyChecker,s4,r871/3:/Table/57/1/76{2/9/1…-4/9/61}] replica (n1,s1):2 is inconsistent: expected checksum 6f698c554509b49a1ad50666b095f4bcc1e710d0808c1fe20f66e0e09bf9a4b7c9dc94376d5cdfd7d26a3ef954731b6f4e61d7ea776a5fc6549e03b90a65226d, got 3603fbbddd95bf94b1d6646735ccdb3392a60e388d0455f24f9666b0522f81ea810251cb1d5d866521832bd3a5dabd554e03da4d81da4a17693006fbfd39d446
    persisted stats: exp {ContainsEstimates:true LastUpdateNanos:1555539780955426343 IntentAge:0 GCBytesAge:0 LiveBytes:-364 LiveCount:-23 KeyBytes:-285 KeyCount:-23 ValBytes:-79 ValCount:-23 IntentBytes:0 IntentCount:0 SysBytes:565 SysCount:5 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}, got {ContainsEstimates:true LastUpdateNanos:1555539780955426343 IntentAge:0 GCBytesAge:0 LiveBytes:34316122 LiveCount:58552 KeyBytes:1339660 KeyCount:58552 ValBytes:32976462 ValCount:58552 IntentBytes:0 IntentCount:0 SysBytes:565 SysCount:5 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
    --- leaseholder
    +++ follower
    -0.000000000,0 /Local/RangeID/871/r/RangeAppliedState
    -    ts:<zero>
    -    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002]\xdbl\xfeY\x03\b8\x10!\x1aR\t'N\xd6m0c\x96\x15 \x94\xfd\xff\xff\xff\xff\xff\xff\xff\x01(\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x010\xe3\xfd\xff\xff\xff\xff\xff\xff\xff\x018\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x01@\xb1\xff\xff\xff\xff\xff\xff\xff\xff\x01H\xe9\xff\xff\xff\xff\xff\xff\xff\xff\x01`\xb5\x04h\x05p\x01"
    -    raw mvcc_key/value: 0169f70367727261736b00 120408001000180020002800325ddb6cfe5903083810211a5209274ed66d306396152094fdffffffffffffff0128e9ffffffffffffffff0130e3fdffffffffffffff0138e9ffffffffffffffff0140b1ffffffffffffffff0148e9ffffffffffffffff0160b50468057001
    +0.000000000,0 /Local/RangeID/871/r/RangeAppliedState
    +    ts:<zero>
    +    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x0025\x96\x1d\x85\xb5\x03\b8\x10!\x1a*\t'N\xd6m0c\x96\x15 ھ\xae\x10(\xb8\xc9\x030\x8c\xe2Q8\xb8\xc9\x03@\xce\xdc\xdc\x0fH\xb8\xc9\x03`\xb5\x04h\x05p\x01"
    +    raw mvcc_key/value: 0169f70367727261736b00 1204080010001800200028003235961d85b503083810211a2a09274ed66d3063961520dabeae1028b8c903308ce25138b8c90340cedcdc0f48b8c90360b50468057001
  2. cluster andrei-1555535390-13-n8cpu4-geo (GCE_PROJECT=andrei-jepsen): Here we have a follower that seems to be completely empty. The leaseholder has a ton of data, present in the diff.

    E190417 22:27:35.539789 77285 storage/replica_consistency.go:134  [n6,consistencyChecker,s6,r552/2:/Table/54/1/65{1/6/3…-3/5/1…}] replica (n1,s1):4 is inconsistent: expected checksum 5fe2eaef32849d0569eb74db44cd78a7d48ee7c85128bef37507f47a448b1facc67eb11e03b8d822bf9738732a10fdfd288e200219aca7534cdce5896ba87c83, got 39e9bc6e80dd52f7ce5350dfe2f2ce6bc9a79fd54bf956d5bdf376686a54cc57d33bfdbf42473b3f7d6f8865a7437eae30690d83e907e4427a40f5676b267574
    persisted stats: exp {ContainsEstimates:true LastUpdateNanos:1555539836245201052 IntentAge:0 GCBytesAge:0 LiveBytes:34315972 LiveCount:58548 KeyBytes:1339688 KeyCount:58548 ValBytes:32976284 ValCount:58548 IntentBytes:0 IntentCount:0 SysBytes:878 SysCount:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}, got {ContainsEstimates:false LastUpdateNanos:1555539836245201052 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SysBytes:878 SysCount:7 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
    --- leaseholder
    +++ follower
    -0.000000000,0 /Local/RangeID/552/r/RangeAppliedState
    -    ts:<zero>
    -    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x0025\xffI4{\x03\b7\x10\x1a\x1a*\t\x9cP]M=c\x96\x15 Ľ\xae\x10(\xb4\xc9\x030\xa8\xe2Q8\xb4\xc9\x03@\x9c\xdb\xdc\x0fH\xb4\xc9\x03`\xee\x06h\ap\x01"
    -    raw mvcc_key/value: 0169f70228727261736b00 1204080010001800200028003235ff49347b030837101a1a2a099c505d4d3d63961520c4bdae1028b4c90330a8e25138b4c903409cdbdc0f48b4c90360ee0668077001
    +0.000000000,0 /Local/RangeID/552/r/RangeAppliedState
    +    ts:<zero>
    +    value:"\x12\x04\b\x00\x10\x00\x18\x00 \x00(\x002\x19m\xb2\xfa\xee\x03\b7\x10\x1a\x1a\x0e\t\x9cP]M=c\x96\x15`\xee\x06h\a"
    +    raw mvcc_key/value: 0169f70228727261736b00 12040800100018002000280032196db2faee030837101a1a0e099c505d4d3d63961560ee066807
    -1555535595.048202377,0 /Table/54/1/651/6/378/0
    -    ts:2019-04-17 21:13:15.048202377 +0000 UTC
    -    value:"մ\xcd\x1a\nF\x10Y7pwAnR0TQoU5nH8\x16\x02OE\x16\rPRICALLYCALLY\x16\x11K4CqqLGzrRCg74JuA\x16\x11coaW1dT9RiedgnRGN\x16\x11XSpIXhOAuYmfJebIz\x16\x02EB\x16\t835211111\x16\x100013839382606905\x18ʃʻ\b\x00\x16\x02GC\x15\x054\x8dLK@\x15\x03)\tw\x15\x04\x1a\x8a\x03\xe8\x15\x044\x8a\x03\xe8\x13\x02\x13\x00\x16\x83\aKo3yn9fyLwx0AL4bEuemxBaljZZtx8sWMIEe2HXikqs9Tf9CH6zL7q3FIrk6O9fnif3TtXL8OjaLiEWMAtzFcoAGRqc0GiBIEa1WrbG03f0GiODMNmX9Do7HP3hH0gcB3Mn6QL0gbFk9AFQDXTA2oourvZJ8wzF5ymvwroLaaXWRIo6so4bZuKPBNenKqJKabQkEQQHm9vR8LzpkZjkgEdsthSTfUHkrQnXGxEPK7Xg7w4fhqzgs0bV7hTo5XHdhpdIxMxt9aCNkMkhAqv0M3k7GLIhcpCT6wI2PRuHeCViWnBFQdSOF4hNFnTQL98qtTQBXd28IIpZPbYyuPPzkrdvTBkLx1otV0XtuxW4Eqg4yFBj8dJIQALmjb9sDVA7bD2TDqDr"
    ...

    More looking tomorrow.

tbg commented 5 years ago

andrei-1555535390-11-n8cpu4-geo

(n1,s1):2:

raft_applied_index:56 lease_applied_index:33 range_stats:<contains_estimates:true last_update_nanos:1555539780955426343 live_bytes:34316122 live_count:58552 key_bytes:1339660 key_count:58552 val_bytes:32976462 val_count:58552 sys_bytes:565 sys_count:5 >

(n4,s4):3:

raft_applied_index:56 lease_applied_index:33 range_stats:<contains_estimates:true last_update_nanos:1555539780955426343 live_bytes:-364 live_count:-23 key_bytes:-285 key_count:-23 val_bytes:-79 val_count:-23 sys_bytes:565 sys_count:5 >

ContainsEstimates=true is due to https://github.com/cockroachdb/cockroach/pull/36525, i.e. it doesn't tell us anything except that SSTs were ingested by this range at some point. Looking at the follower's stats this makes sense anyway, so it doesn't tell us much.

The checkpoints worked like a charm. On n4 (the leaseholder with the weird stats) the Raft log reveals... nothing. It's recently been truncated, so all there is is the queue activity.

So let's look at what's actually in r871 on n4:

$ ./cockroach debug check-store /mnt/data1/cockroach/auxiliary/checkpoints/2019-04-17T22\:25\:10Z/
...
(ignored) r871:/Table/57/1/76{2/9/1509-4/9/61} [(n6,s6):4, (n1,s1):2, (n4,s4):3, next=5, gen=110]: diff(actual, claimed): LiveBytes: 34316122 != -364
LiveCount: 58552 != -23
KeyBytes: 1339660 != -285
KeyCount: 58552 != -23
ValBytes: 32976462 != -79
ValCount: 58552 != -23

See something? The data actually in the replica has actual stats matching that of the follower. The leaseholder persistently has the stats messed up, so this looks like a true divergence.

Here are the logs for r871 (before the inconsistency):

https://gist.github.com/tbg/779bd8813ea8dec2d45579ca9d3e8c2d

There's nothing related to n1 and n4, but a little while before the crash we remove the replica on n5 which results in this message:

removed -58570 (-58575+5) keys in 1ms [clear=0ms commit=1ms]

This suggests that n5 was messed up similar to n1.

tbg commented 5 years ago

andrei-1555535390-13-n8cpu4-geo

On n6 (leaseholder) the stats agree with recomputation, so there's tons of data. The Raft log is basically only log truncations and the consistency checks again. On n1 (follower), the range stats key is

0.000000000,0 /Local/RangeID/552/r/RangeAppliedState (0x0169f70228727261736b00): raft_applied_index:55 lease_applied_index:26 range_stats:<last_update_nanos:1555539836245201052 sys_bytes:878 sys_count:7 >

i.e. very very empty. And guess what? It also agrees with recomputation (here's the dump of the data in it).

tbg commented 5 years ago

This looks like there are two different failure modes already.

In the first case, the leaseholder (and probably another follower, since removed) had the same data, but diverging stats. Tentatively speaking, it seems as if the leaseholder had at least forgotten to add the stats for the data it ingested at some point, though the fact that its stats are negative indicates that more might be going on.

In the second case, both replicas are internally consistent (i.e. stats match data) but one replica is missing tons of stuff. This looks more classically like what you'd expect if a replica "skipped" a raft command erroneously, similar to #35424.

It's frustrating to not have the Raft logs, I think we'd really need them to tell us what's up. @andreimatei can you disable the raft log queue and see if you can get a repro that way?

tbg commented 5 years ago

BTW, in Nathan's original issue, it seems like yet another failure mode: the stats were the same, but one node was missing data (and this was then swept under the rug by the range deletions).

tbg commented 5 years ago

for another addendum, there's no clear argument that the inconsistencies immediately followed splits.

In Andrei's first repo, the split happens a minute before the inconsistency is detected. Shortly after the split we see the (then) leaseholder n5 send a snapshot to n6. That snapshot is basically empty. n5 is later removed and results in the inconsistent log line claiming to have removed a large negative number of keys. This number is just the (in-mem) stats of the range, so it got really messed up. This points at a split, as nothing else would've generated a negative delta.

@andreimatei let's try flipping this assertion on

https://github.com/cockroachdb/cockroach/blob/1cd61f9fa27ebcd7c16d79790ea4083b8c8f3666/pkg/storage/replica_raft.go#L2021

In the second repro, I can't find anything suspicious-looking; there are empty-looking snapshots right after the split so if those are to be trusted, the range starts out empty. It's worth pointing out that since ContainsEstimates=true in all the repros I've seen so far, SSTs have been ingested by all of the involved ranges directly (the RHS of a split starts out with no estimates).

Need more repros! But this problem is way worse than I anticipated, but since it repros so easily I doubt that we'll have trouble tracking it down given enough time. BTW @andreimatei can you always share the SHAs that you're running? Right now I don't even know if it's master or release-19.1.

tbg commented 5 years ago

I found Andrei's branch, so I'm gonna start a repro cycle of my own.

tbg commented 5 years ago

Up and running, my branches are

crdb binary: https://github.com/tbg/cockroach/tree/repro/import-geo-36861 roachtest: https://github.com/tbg/cockroach/tree/repro.geo-import-failure (I made the test use a monitor and stop nodes when it fails)

and the command

./bin/roachtestandrei run import-geo --debug --cpu-quota=1000 --roachprod=bin/roachprod --count=100 --parallelism=10 --port=8080
tbg commented 5 years ago

The first failure I'm seeing is an OOM. I mention this because look where the memory is (inuse_space)

image

we're reproposing lots of large commands (presumably ssts). Potentially this is just fallout of the assertions I added. In #35424, we were also reproposing continuously, but because of the log divergence. Here I probably managed to get the node so slow that the tick loop overwhelmed it.

Got another oom:

image

Here it looks like we are inlining too many sideloaded files in Entries() (or perhaps holding on to them for too long).

tbg commented 5 years ago

Boom:

tobias-1555577996-03-n8cpu4-geo:7

F190418 11:14:44.497619 214 storage/replica_raft.go:2043  [n7,s7,r4083/3:/Table/58/1/55{6/927…-7/966…}] LHS split stats divergence: diff(claimed, computed) = [LiveBytes: -34906199 != 0 LiveCount: -103967 != 0 KeyB
ytes: -2321375 != 0 KeyCount: -103967 != 0 ValBytes: -32584824 != 0 ValCount: -103967 != 0 SysBytes: 643 != 841]
goroutine 214 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0000b1b01, 0xc0000b1b00, 0x54e3200, 0x17)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5c781a0, 0xc000000004, 0x54e329f, 0x17, 0x7fb, 0xc0008f0480, 0x10f)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3b14800, 0xc00a4e3290, 0x4, 0x2, 0x341cb39, 0x38, 0xc006666fb0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3b14800, 0xc00a4e3290, 0x1, 0xc000000004, 0x341cb39, 0x38, 0xc006666fb0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3b14800, 0xc00a4e3290, 0x341cb39, 0x38, 0xc006666fb0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc013cca000, 0x3b14800, 0xc00a4e3290, 0xc01b04cab0, 0x8, 0x8, 0x6f, 0x300000003, 0x2, 0xa, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:2043 +0x133a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc013cca000, 0x3b14800, 0xc00a4e3290, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:816 +0x13df
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x3b14800, 0xc00a4e3290, 0xc013cca000, 0x3b14800)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3682 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc000ac0c00, 0x3b14800, 0xc00a4e3290, 0xc006506fb0, 0xc01c8ffed0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3333 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc000ac0c00, 0x3b14800, 0xc000587f20, 0xff3)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3670 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc000998b00, 0x3b14800, 0xc000587f20)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x3b14800, 0xc000587f20)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc00021b240, 0xc000908000, 0xc00021b1e0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

tobias-1555577996-05-n8cpu4-geo:1

F190418 12:13:14.955871 124 storage/replica_raft.go:2043  [n1,s1,r6911/3:/Table/61/1/23{76/56…-81/48…}] LHS split stats divergence: diff(claimed, computed) = [LiveBytes: -34905985 != 0 LiveCount: -103963 != 0 KeyB
ytes: -2321287 != 0 KeyCount: -103963 != 0 ValBytes: -32584698 != 0 ValCount: -103963 != 0 SysBytes: 809 != 677 SysCount: 6 != 5]
goroutine 124 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b01, 0xc000057b60, 0x54e3200, 0x17)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5c781a0, 0xc000000004, 0x54e329f, 0x17, 0x7fb, 0xc00aeb2240, 0x120)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3b14800, 0xc0139ce450, 0x4, 0x2, 0x341cb39, 0x38, 0xc00a00efb0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3b14800, 0xc0139ce450, 0x1, 0xc000000004, 0x341cb39, 0x38, 0xc00a00efb0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3b14800, 0xc0139ce450, 0x341cb39, 0x38, 0xc00a00efb0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc00c5d2a00, 0x3b14800, 0xc0139ce450, 0xc0070caa30, 0x8, 0x7, 0x29, 0x300000003, 0x1, 0x11, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:2043 +0x133a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc00c5d2a00, 0x3b14800, 0xc0139ce450, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:816 +0x13df
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x3b14800, 0xc0139ce450, 0xc00c5d2a00, 0x3b14800)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3682 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc000944c00, 0x3b14800, 0xc0139ce450, 0xc096f42e00, 0xc021953ed0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3333 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc000944c00, 0x3b14800, 0xc0009dc0f0, 0x1aff)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3670 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc0009cdb80, 0x3b14800, 0xc0009dc0f0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x3b14800, 0xc0009dc0f0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc000204590, 0xc0004b6090, 0xc000204580)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8
tbg commented 5 years ago

In both cases, what we're seeing is that after having applied the split trigger (downstream of Raft) including its side effects, when we recompute the stats of the LHS, we find that it disagrees with r.GetMVCCStats() (and presumably also the persisted stats, though that remains to be seen).

The claimed stats here are "very negative". The recomputed stats are basically zero. In both examples we see a mismatch in SysBytes.

We have the full Raft logs here, nothing was truncated. Time to hopefully figure something out!

tbg commented 5 years ago

From the first repro listed above (-03-):

$ diff repro03/log.n{3,7}.4083
317c317
< proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:23 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" >
---
> proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:12 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" >
321c321
< proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:23 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" >
---
> proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:12 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" >
405c405
< proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:32 replicated_eval_result:<timestamp:<wall_time:1555586014659531402 > delta:<contains_estimates:true last_update_nanos:1555586014659531402 live_bytes:34906199 live_count:103967 key_bytes:2321375 key_count:103967 val_bytes:32584824 val_count:103967 > add_sstable:<crc32:3893794671 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" >
---
> proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:27 replicated_eval_result:<timestamp:<wall_time:1555586014659531402 > delta:<contains_estimates:true last_update_nanos:1555586014659531402 live_bytes:34906199 live_count:103967 key_bytes:2321375 key_count:103967 val_bytes:32584824 val_count:103967 > add_sstable:<crc32:3893794671 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" >

More clearly, here's the a block of reproposals of some AddSSTable command from n3's raft log:

0.000000000,0 /Local/RangeID/4083/u/RaftLog/logIndex:59 (0x0169f70ff3757266746c000000000000003b00): Term:6 Index:59  by lease #10
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:12 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/4083/u/RaftLog/logIndex:60 (0x0169f70ff3757266746c000000000000003c00): Term:6 Index:60  by lease #10
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:23 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/4083/u/RaftLog/logIndex:61 (0x0169f70ff3757266746c000000000000003d00): Term:6 Index:61  by lease #10
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:23 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/4083/u/RaftLog/logIndex:62 (0x0169f70ff3757266746c000000000000003e00): Term:6 Index:62  by lease #10
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:23 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/4083/u/RaftLog/logIndex:63 (0x0169f70ff3757266746c000000000000003f00): Term:6 Index:63  by lease #10
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:23 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/4083/u/RaftLog/logIndex:64 (0x0169f70ff3757266746c000000000000004000): Term:6 Index:64  by lease #10
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:23 replicated_eval_result:<timestamp:<wall_time:1555585912446414057 > delta:<contains_estimates:true last_update_nanos:1555585912446414057 live_bytes:34906315 live_count:103974 key_bytes:2321529 key_count:103974 val_bytes:32584786 val_count:103974 > add_sstable:<crc32:2391755456 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

The corresponding block on n7 looks identical, except the second and third entry have max_lease_index 12 (not 23).

log.n3.4083.txt log.n5.4083.txt log.n7.4083.txt

tbg commented 5 years ago

Oh, and I missed one. This is from n3, note the max_lease_index: 32. The entry before and after are identical but have max_lease_index: 27. On n7, entry 75 has max_lease_index: 27, like its neighbors.


0.000000000,0 /Local/RangeID/4083/u/RaftLog/logIndex:75 (0x0169f70ff3757266746c000000000000004b00): Term:6 Index:75  by lease #10
proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:10 max_lease_index:32 replicated_eval_result:<timestamp:<wall_time:1555586014659531402 > delta:<contains_estimates:true last_update_nanos:1555586014659531402 live_bytes:34906199 live_count:103967 key_bytes:2321375 key_count:103967 val_bytes:32584824 val_count:103967 > add_sstable:<crc32:3893794671 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:
tbg commented 5 years ago

PS if we wanted to isolate a change that is likely to have caused this, my bet would be https://github.com/cockroachdb/cockroach/pull/35261.

I'll keep looking more for patterns for now.

tbg commented 5 years ago

Moving on to the -05- cluster:

log.n3.6911.txt log.n6.6911.txt log.n1.6911.txt

The assertion tripped at index 41, so I'll clip the logs there.

Same game:

0.000000000,0 /Local/RangeID/6911/u/RaftLog/logIndex:25 (0x0169f71aff757266746c000000000000001900): Term:6 Index:25  by lease #17
proposer_replica:<node_id:3 store_id:3 replica_id:1 > proposer_lease_sequence:17 max_lease_index:8 replicated_eval_result:<timestamp:<wall_time:1555589532901757864 > delta:<contains_estimates:true last_update_nanos:1555589532901757864 live_bytes:34905985 live_count:103963 key_bytes:2321287 key_count:103963 val_bytes:32584698 val_count:103963 > add_sstable:<crc32:4248373557 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/6911/u/RaftLog/logIndex:26 (0x0169f71aff757266746c000000000000001a00): Term:6 Index:26  by lease #17
proposer_replica:<node_id:3 store_id:3 replica_id:1 > proposer_lease_sequence:17 max_lease_index:12 replicated_eval_result:<timestamp:<wall_time:1555589532901757864 > delta:<contains_estimates:true last_update_nanos:1555589532901757864 live_bytes:34905985 live_count:103963 key_bytes:2321287 key_count:103963 val_bytes:32584698 val_count:103963 > add_sstable:<crc32:4248373557 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

This is on n3. On n1 and n6, the entry at index 26 has max_lease_index: 8, making it identical to the preceding command in the log (which all nodes agree on).

tbg commented 5 years ago

Note that n3 has the longest log (it has an entry at 48; n1 has 41 which is the crashing split; n6 only has index 27 and below), so it's basically guaranteed to have been the raft leader. On the -03- repro (the earlier one) n7 crashed on its latest entry, the split at 111, n3 has 112, n5 is behind at 98. So here n3 is the raft leader (and, btw, the followers agree with each other). Additionally we see from the proposer_lease that n3 was also the leaseholder (as it ought to be in the common case).

tbg commented 5 years ago

Summing this up: we seem to be seeing a problem with how the max_lease_index is assigned when reproposals happen. It seems (pending confirmation via further repros which have probably happened by now) that

  1. the problem always occurs on AddSSTable (but that's likely only chance)
  2. the problem seems to be with the leader's log: the max_lease_index for some entries diverges from the copy sent to the followers in ways that can be unintuitive (in the first repro, going from 27 to 32 back to 27 while the followers stay at 27) but usually the phenomenon is that an outlier higher max_lease_index shows up in the leader's log while the followers repeat the existing one.
nvanbenschoten commented 5 years ago

Here's a crazy idea: what if the new logic in #35261 is allowing a proposal from before a split that would no longer be valid after a split to be reproposed and applied after the split. For instance, what would happen if we proposed an AddSSTable request for keys [a,z) at max_lease_index=10, a split occured at LeaseAppliedIndex=11 to split the range into [a,m) and [m,z), and then we reproposed the AddSSTable request with max_lease_index=12 after hitting a proposalIllegalLeaseIndex?

tbg commented 5 years ago

How would the split be proposed? The latches held by the AddSSTable would hold it back, wouldn't they? Also, we're seeing Raft log divergence, which I think is ultimately what we have to explain (diverging max lease indexes definitely implies divergence given the right circumstances)

The problem basically has to be some kind of memory sharing between https://github.com/cockroachdb/cockroach/blob/3b281b44e522c62ac2be2988d0f05f952f8adfda/pkg/storage/replica_raft.go#L2077-L2086

and tick-based reproposals. max_lease_index is assigned in https://github.com/cockroachdb/cockroach/blob/3b281b44e522c62ac2be2988d0f05f952f8adfda/pkg/storage/replica_raft.go#L61-L64

which (for a command already in the log) is only ever called from tryReproposeWithNewLeaseIndex. I don't know how else these different max lease indexes could turn up. At the same time, even if that were all completely messed up, the commands get serialized before being handed to Raft, so how do we end up with a divergence? Neither should the raft entry cache be able to screw anything up because it only sees the marshaled proposal.

I'm very confused, but we seem to be in a good spot for cracking this. I got at least one more repro in the meantime, so I'm going to look at that now.

tbg commented 5 years ago

oh PS - the offending AddSSTable commands were always adjacent in the logs and before the failing split. The splits didn't seem to play a role, they just happened to have the assertion in them that would catch the divergence which is presumably caused by the divergent max_lease_index across followers.

tbg commented 5 years ago

repro -14-: nodes 2,4,5 log.n4.4045.txt log.n2.4045.txt log.n5.4045.txt

raft logs for 4 and 5 agree. Both crashed with the assertion either probably at index 59 or 60. n2 only has entry 59 (also a split) but didn't crash; here the situation is a bit weird because n2 proposed the splits. Presumably the leaseholder was n2 but the Raft leadership was at 4 or 5. Anyway, here's a triplicated AddSSTable from n4

0.000000000,0 /Local/RangeID/4045/u/RaftLog/logIndex:48 (0x0169f70fcd757266746c000000000000003000): Term:11 Index:48  by lease #11
proposer_replica:<node_id:2 store_id:2 replica_id:3 > proposer_lease_sequence:11 max_lease_index:12 replicated_eval_result:<timestamp:<wall_time:1555596342963051881 > delta:<contains_estimates:true last_update_nanos:1555596342963051881 live_bytes:34906017 live_count:103957 key_bytes:2321155 key_count:103957 val_bytes:32584862 val_count:103957 > add_sstable:<crc32:1664229362 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/4045/u/RaftLog/logIndex:49 (0x0169f70fcd757266746c000000000000003100): Term:11 Index:49  by lease #11
proposer_replica:<node_id:2 store_id:2 replica_id:3 > proposer_lease_sequence:11 max_lease_index:12 replicated_eval_result:<timestamp:<wall_time:1555596342963051881 > delta:<contains_estimates:true last_update_nanos:1555596342963051881 live_bytes:34906017 live_count:103957 key_bytes:2321155 key_count:103957 val_bytes:32584862 val_count:103957 > add_sstable:<crc32:1664229362 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

0.000000000,0 /Local/RangeID/4045/u/RaftLog/logIndex:50 (0x0169f70fcd757266746c000000000000003200): Term:11 Index:50  by lease #11
proposer_replica:<node_id:2 store_id:2 replica_id:3 > proposer_lease_sequence:11 max_lease_index:12 replicated_eval_result:<timestamp:<wall_time:1555596342963051881 > delta:<contains_estimates:true last_update_nanos:1555596342963051881 live_bytes:34906017 live_count:103957 key_bytes:2321155 key_count:103957 val_bytes:32584862 val_count:103957 > add_sstable:<crc32:1664229362 > > write_batch:<data:"\000\000\000\000\000\000\000\000\000\000\000\000" > 
write batch:

On n2, the middle one has max_lease_index: 17.

It makes sense that n2 didn't crash because on n2 with that high lease index, the SST probably applied. On the other nodes, it didn't and so they were off when the next split (evaluated on n2) came by.

Simple question, how does tryReproposeWithNewLeaseIndex result in a proposal added to the log using new lease index, but followers still receive a duplicate of the unmodified proposal?

nvanbenschoten commented 5 years ago

The latches held by the AddSSTable would hold it back, wouldn't they?

I was thinking that there could have been a lease transfer in the meantime, but I forgot that NotLeaseHolderErrors take priority over proposalIllegalLeaseIndex retries in checkForcedErrLocked.

The problem basically has to be some kind of memory sharing

Yes, that seems like the most likely cause. Here's some logic that looks very sketchy to me: https://github.com/cockroachdb/cockroach/blob/b47269e0691d93e0b0c6df22270fee6afb0b6925/pkg/storage/replica_sideload.go#L78

If it was called from here when trying to append to the leader's log, I bet we could see a proposal with a future MaxLeaseIndex: https://github.com/cockroachdb/cockroach/blob/b47269e0691d93e0b0c6df22270fee6afb0b6925/pkg/storage/replica_raft.go#L682

tbg commented 5 years ago

Oh, that.... looks sketchy, I hadn't remembered that, agreed that that looks like a hot trail. I'll be out for a few hours (or probably out for good for today, actually), wouldn't mind a bulletproof writeup of that theory when I come back 😄

tbg commented 5 years ago

repro clusters are here:

tobias-1555577996-03-n8cpu4-geo:   [gce]  8   (6h1m48s)
tobias-1555577996-04-n8cpu4-geo:   [gce]  8   (6h1m48s)
tobias-1555577996-05-n8cpu4-geo:   [gce]  8   (6h1m48s)
tobias-1555577996-14-n8cpu4-geo:   [gce]  8   (9h1m48s)

Planning to just let them expire (6h and 9h) unless anyone thinks they're going to be useful.

nvanbenschoten commented 5 years ago

In the following case, we favor our in-memory representation of a RaftCommand from r.mu.proposals over the serialized command we got from etcd/raft: https://github.com/cockroachdb/cockroach/blob/b47269e0691d93e0b0c6df22270fee6afb0b6925/pkg/storage/replica_sideload.go#L115-L119

So the theory is that when appending to a Replica's persistent Raft log, we check whether we have the RaftCommand locally in-memory. On the leaseholder we do, on the followers we don't. So on the followers everything works out as normal and we unmarshal the byte slice given to us by etcd/raft. However, on the leaseholder we ignore the byte slice given to us by etcd/raft and favor our in-memory RaftCommand from r.mu.proposals. We assume that this is exactly the same as the command provided to us by Raft, but conveniently already unmarshalled. But this isn't always true! Specificaly, if the command has since been reproposed again by tryReproposeWithNewLeaseIndex, it will have a larger MaxLeaseIndex. We then proceed to modify this version of the RaftCommand, marshal it, and write it to our Raft log. Fast-forward to the end of the consensus round and we just applied the wrong entry on the leaseholder.

There are two things that make AddSSTable requests special here:

  1. they're the only proposals that hit this logic in maybeSideloadEntriesImpl
  2. they won't be stored in the raftentry.Cache, so when they are applies, they are guaranteed come from the persistent Raft log (which we just saw was corrupted)
nvanbenschoten commented 5 years ago

So the fix would be either to not have a "happy case" in maybeSideloadEntriesImpl or to add a flag to ProposalData to invalidate it from being used in cases like these once it has been reproposed at least once.

nvanbenschoten commented 5 years ago

An option for the 19.1 release would be to revert #35261 entirely. Fixing this issue will straightforward (if the theory is correct), but this is revealing a new class of bugs that may now be lurking. What do you think @bdarnell?

nvanbenschoten commented 5 years ago

Specifically, before #35261 Replica.mu.proposals[cmdID].RaftCommand was immutable after the command was proposed to Raft. Anyone could take a command ID and find its corresponding RaftCommand exactly as it looked when proposed. This is no longer the case and we now need to be careful about never assuming that a command in Replica.mu.proposals is an accurate in-memory representation of an arbitrary encoded Raft command with the same ID.

nvanbenschoten commented 5 years ago

I'm running another iteration with the following diff. I expect it to have no issues:

diff --git a/pkg/storage/replica_sideload.go b/pkg/storage/replica_sideload.go
index 63b3734..92a709a 100644
--- a/pkg/storage/replica_sideload.go
+++ b/pkg/storage/replica_sideload.go
@@ -113,7 +113,7 @@ func maybeSideloadEntriesImpl(
                        ent := &entriesToAppend[i]
                        cmdID, data := DecodeRaftCommand(ent.Data) // cheap
                        strippedCmd, ok := maybeRaftCommand(cmdID)
-                       if ok {
+                       if ok && false {
                                // Happy case: we have this proposal locally (i.e. we proposed
                                // it). In this case, we can save unmarshalling the fat proposal
                                // because it's already in-memory.
bdarnell commented 5 years ago

An option for the 19.1 release would be to revert #35261 entirely. Fixing this issue will straightforward (if the theory is correct), but this is revealing a new class of bugs that may now be lurking. What do you think @bdarnell?

My initial reaction is that #35261 significantly reduced a common source of retry errors and it would be unfortunate to lose it. I'd rather lose the "happy case" that at best applies to only one of the replicas. This looks like the only problematic use of r.mu.proposals (outside the proposal pipeline itself, which I vetted as a part of 35261)

nvanbenschoten commented 5 years ago

I'm running another iteration with the following diff. I expect it to have no issues

No consistency check failure in sight.

Specifically, before #35261 Replica.mu.proposals[cmdID].RaftCommand was immutable after the command was proposed to Raft. Anyone could take a command ID and find its corresponding RaftCommand exactly as it looked when proposed.

Heh, @tbg actually picked up on the need for this invariant back when we introduced this inlining code: https://github.com/cockroachdb/cockroach/commit/917039941244698ecbb64f4c8b9e6b123624715d#diff-0f4e61b240de001f1b97a87fc31af043R341.

I'd rather lose the "happy case" that at best applies to only one of the replicas.

Agreed. The "happy case" doesn't seem like it's buying us all that much since it only applies on the leaseholder.

tbg commented 5 years ago

I'd also rather fix the bug.

tbg commented 5 years ago

Reopening until we've added the geo import test to our suite. What cadence should we be running that? It seems to tickle new behavior. From what I saw during the repros, the geo distribution causes costly reproposals.I assume we'll get value out of observing this test.