roachtest: 22.1 restore 2TB/nodes=10/with-pause failed -- unavailable range

msbutler commented 1 year ago

The health monitor that accompanies the 22.1 restore2TB/nodes=10/with-pause test failed the test due to an unavailable range. See original github comment here. to get artifacts. While we know a resumed restore will stress the system because it causes crdb to reingest a significant amount of data, we don't expect it to cause range unavailability.

It could be worth finding the root cause of the range unavailability. I'm hoping a kv engineer can take a look, as I've gleaned as much as I can from the logs.

Some context on resumed restores:

A resumed restore has previously caused an escalation bc it caused checkSST conflicts to grind to a halt. We took steps to prevent this, and it seems this roachtest failure is unrelated.
for 23.1, we are actively working to prevent data reingestion, which will alleviate stress in future resumed restore jobs.

Here's what I've gleaned from the logs: The restore2TB/nodes=10/with-pause test attempts to conduct a restore while at most 3 pauses get applied to the job. After the first pause, the logs indicate that SST ingrestion is delayed by backpressure (around 8:08). These warnings continued until the roachtest failed.

After all three pauses were applied, n4 failed a node liveness heartbeat:

W221214 08:43:37.750459 237 kv/kvserver/liveness/liveness.go:883 ⋮ [n4,liveness-hb] 6340  slow heartbeat took 4.50848s; err=result is ambiguous: context done during DistSender.Send: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›

Because of an unavailable range:

W221214 08:43:45.259442 264 server/node.go:894 ⋮ [n4,summaries] 6357  health alerts detected: {Alerts:[{StoreID:0 Category:‹METRICS› Description:‹liveness.heartbeatfailures› Value:5} {StoreID:4 Category:‹METRICS› Description:‹ranges.unavailable› V     alue:57} {StoreID:4 Category:‹METRICS› Description:‹ranges.underreplicated› Value:319}]

The health monitor then failed the roachtest, not the restore job itself.

health: 08:43:55 restore.go:101: health check terminated with read tcp 172.17.0.3:58058->34.23.138.96:26257: read: connection reset by peer

Jira issue: CRDB-22497

Epic CRDB-20293

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

msbutler commented 1 year ago

adding a release blocker tag to this as I'm unsure what the root cause is of the range unavailability.

cockroach-teamcity commented 1 year ago

roachtest.restore2TB/nodes=10/with-pause failed with artifacts on release-22.1 @ 0403676c25d7c9bd0cadc42211885d649085e499:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore2TB/nodes=10/with-pause/run_1
    assertions.go:262,assertions.go:695,require.go:1351,restore.go:798,test_runner.go:883: 
            Error Trace:    restore.go:798
                                        test_runner.go:883
                                        asm_amd64.s:1581
            Error:          Should not be: 3
            Test:           restore2TB/nodes=10/with-pause
            Messages:       the job should have paused at least once

    test_impl.go:259,require.go:1354,restore.go:798,test_runner.go:883:

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #90023 roachtest: restore2TB/nodes=10/with-pause failed [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-22.2] - #85448 roachtest: restore2TB/nodes=10/with-pause failed [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.2]

This test on roachdash | Improve this report!

msbutler commented 1 year ago

^latest failure is not relevant.

cockroach-teamcity commented 1 year ago

roachtest.restore2TB/nodes=10/with-pause failed with artifacts on release-22.1 @ 6f2abfb07a6bb8fc7c9d1e52c7b5e8aab2dac030:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore2TB/nodes=10/with-pause/run_1
    assertions.go:262,assertions.go:695,require.go:1351,restore.go:798,test_runner.go:883: 
            Error Trace:    restore.go:798
                                        test_runner.go:883
                                        asm_amd64.s:1581
            Error:          Should not be: 3
            Test:           restore2TB/nodes=10/with-pause
            Messages:       the job should have paused at least once

    test_impl.go:259,require.go:1354,restore.go:798,test_runner.go:883:

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #90023 roachtest: restore2TB/nodes=10/with-pause failed [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-22.2] - #85448 roachtest: restore2TB/nodes=10/with-pause failed [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.2]

This test on roachdash | Improve this report!

msbutler commented 1 year ago

^unrelated flake. I wish the x-noreuse label worked on 22.1

tbg commented 1 year ago

teamcity-7952904-1670998741-21-n10cpu4-0007> I221214 08:42:01.243634 201 kv/kvserver/pkg/kv/kvserver/store_raft.go:522 ⋮ [n7,s7,r9/8:‹/Table/1{3-4}›,raft] 6241 raft ready handling: 17.18s [append=0.00s, apply=0.00s , commit-batch-sync=17.18s, other=0.00s], wrote 2.3 KiB sync [append-ent=2.2 KiB (1), ]; node might be overloaded

Whatever happened right around 08:42 really slammed the machine's storage subsystem. If it takes 17+s to fsync 2.3KiB (to be fair, other writes might've been batched in with this one, but the other writes complaining about this are also very small).

I poked at the tsdump a bit and unfortunately it's not obvious what was going on here. Storage got slow, yes, but why?

The hardware dashboard doesn't show any dramatic changes at 8:41:44, which would be the timestamp at which I can first pin slow I/O with the logs.

The pebble logs, to my semi-trained eye, look OK. It's compacting quite a bit, at output rates of 50-100mb/s, so the volume is fine. No write stalls, not an inverted LSM best I can tell, and the compaction rate graphs also look "flat".

Meanwhile, in the main logs, likely as a result of the slow raft handling caused by slow pebble batch commit, receive queues are filling up, etc, and nothing really works any more. It's not surprising that we see liveness failures and unavailable ranges in that kind of situation.

Btw this can't be right, no? We're just running on a regular store. Just pasting in case this means anything or someone wants to fix:

        "properties": {
          "encrypted": false,
          "read_only": false,
          "file_store_properties": {
            "path": "/mnt/data1/cockroach",
            "fs_type": "tmpfs",
            "block_device": "tmpfs",
            "mount_point": "/run/user/1000",
            "mount_options": "rw,nosuid,nodev,relatime,size=1536256k,mode=700,uid=1000,gid=1000"
          }
        }

I'm at the end of my rope here. Pebble isn't obviously doing poorly, but if pebbleBatch.Commit takes >17s it definitely is. Will ask the storage team to take a look.

Artifacts are here: https://drive.google.com/drive/folders/1fyKQkbEzRXL_Dw16Nlw-HRm8IXtI4SqQ?usp=sharing Interesting period is 8:41:44, which is when judging from the logs the first major stall happened (it's 17s before the "handle ready" message that reports a 17s delay).

tbg commented 1 year ago

Ok found more direct evidence of pebble slowness:

artifacts/logs/2.unredacted/cockroach.log:F221214 08:43:49.311778 285477 storage/pebble.go:915 ⋮ [n2] 7626 disk stall detected: pebble unable to write to ‹/mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r4825/i53.t6› in 20.00 seconds artifacts/logs/2.unredacted/cockroach.log:F221214 08:43:49.444175 285507 storage/pebble.go:915 ⋮ [n2] 7631 disk stall detected: pebble unable to write to ‹/mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r4865/i29.t6› in 20.02 seconds artifacts/logs/6.unredacted/cockroach.log:F221214 08:43:49.900875 307193 storage/pebble.go:915 ⋮ [n6] 6934 disk stall detected: pebble unable to write to ‹/mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r2994/i57.t7› in 20.00 seconds

tbg commented 1 year ago

Here's directly from the logs. This is n6, you can see how it's timing out.

https://gist.github.com/tbg/9aac8a33a6908096c8da36d3e356cc72

Same for n2 https://gist.github.com/tbg/377bb18baba611d99cb4a57d642f2781

tbg commented 1 year ago

Cluster name was teamcity-7952904-1670998741-21-n10cpu4-0001.

tbg commented 1 year ago

BTW I removed release-blocker since this is likely a cloud flake somehow, since it stalled two nodes at around the same time. Slack thread

nicktrav commented 1 year ago

One thing to note is that nodes 7 and 10 were also seeing disk slowness around the same time as nodes 2 and 6. Nodes 7 and 10 seemed to recover, while nodes 2 and 6 did not, and ultimately crashed.

So we have four nodes seeing disk slowness around the same time. It's possible, though unlikely, those four nodes were all sharing the same physical machine or storage infrastructure. Perhaps we could ask GCP Support to take a look?

nicktrav commented 1 year ago

We're still operating under the assumption that this was an infra flake. However, we decided to reach out to GCP support to see if there was anything they could glean about the correlated failures. Link to support case (internal).

I'm going to hang onto this one until we get word back on the support case.

nicktrav commented 1 year ago

The support case (linked above) is still open. There's not any immediate action we can take here, related to this test specifically. I'm going to close this out.

cockroachdb / cockroach

roachtest: 22.1 restore 2TB/nodes=10/with-pause failed -- unavailable range #93718