cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.52k stars 3.7k forks source link

crash looping -- sideloaded file not found #113135

Closed andrewbaptist closed 6 months ago

andrewbaptist commented 8 months ago

Describe the problem

In a customer environment, a restore caused the disks to fill up which resulted in many of the nodes crashing due to full disks / disk stalls (Note: This should probably be a separate issue, as this shouldn't cause a crash). During one of the crashes, we "lost" a side-loaded file. It wasn't clear if the file was written and then deleted or never written at all, but either way, there was an "unprocessed" Raft entry pointing to the file, and the file was gone. This caused the system to repeatedly crash loop until the node was taken offline and replaced by a different file.

The panic error is: loading sideloaded data: sideloaded file not found.

To Reproduce

I have not attempted to reproduce it yet, but it would be worth trying a db restore and repeatedly killing one of the nodes during the restore using kill -9. We should also test with a full reboot of the OS as it may catch other issues as well.

Expected behavior The side loaded files and the Raft log should be consistent.

Environment:

Jira issue: CRDB-32766

blathers-crl[bot] commented 8 months ago

cc @cockroachdb/replication

pav-kv commented 8 months ago

It wasn't clear if the file was written and then deleted or never written at all

Good point regarding deletion. I explored this, and it seems like ~one of the possible scenarios is~:

  1. Raft log has a few sideloaded entries.
  2. Raft gets a log write which overrides a tail of entries. https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L174
  3. An async log write is sent to Pebble. https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L252-L259
  4. In the meantime, execution proceeds and removes a bunch of overridden raft entries. https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L291-L304
  5. Crash, before the async log write 2 completes.
  6. Restart sees the old state from step 0, but does not see some entries because they were removed in step 3.
pav-kv commented 8 months ago

@erikgrinaker @nvanbenschoten WDYT, does this sound plausible?

pav-kv commented 8 months ago

Specifically, this assumption got broken after introduction of async log writes: https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L295-L296

blathers-crl[bot] commented 8 months ago

Hi @pavelkalinnikov, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

pav-kv commented 8 months ago

Ah no, nonBlockingSync can't be true when overwriting: https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L248

False alarm.

pav-kv commented 8 months ago

There still something weird here. Would appreciate a second pair of eyes on the following:

Even though we don't use async log write when there are log entries to be overwritten https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L239-L252

we still may commit without sync and proceed to deleting sideloaded entries:

https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L276

pav-kv commented 8 months ago

If any of these lines incorrectly assign false, we may have a bug. https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L236-L237

It's hard to tell exactly whether len(m.Responses) > 0 is a safe check to rely on. It's assigned in etcd/raft as follows: https://github.com/etcd-io/raft/blob/f3fa38bee62ec85ed7c9ee36c1073348c845bf4f/rawnode.go#L248-L260.

Previously, we relied on the MustSync field which was well defined: https://github.com/etcd-io/raft/blob/f3fa38bee62ec85ed7c9ee36c1073348c845bf4f/rawnode.go#L163.

pav-kv commented 8 months ago

Another suspect for a sneaky deletion is https://github.com/cockroachdb/cockroach/blob/6a447621356ba0366cc0db15cb3dcf26327d1152/pkg/kv/kvserver/replica_application_result.go#L488-L495 although it promises that it's safe.

It would be worth double checking though that the truncated state is indeed synced before we get to this line.

Upd: well, we don't sync. https://github.com/cockroachdb/cockroach/blob/6a447621356ba0366cc0db15cb3dcf26327d1152/pkg/kv/kvserver/replica_app_batch.go#L557-L568

pav-kv commented 8 months ago

So, here is a new scenario to try:

  1. Batch with the log truncation is committed (but not synced).
  2. Sideloaded files are removed (as a side effect of this commit).
  3. Kill (before the new state from step 1 is synced).
  4. Restart briefly sees the pre-commit state, and panics on missing sideloaded entries.
pav-kv commented 8 months ago

This issue is possibly related to #38566 (and #36414 linked from it).

pav-kv commented 8 months ago

Managed to reproduce this panic in #114191. Going to fix it and polish the test.

pav-kv commented 7 months ago

The panic in #114191 is due to a different issue: #114411. The latter should be fixed first.

pav-kv commented 7 months ago

Now #114191 reproduces the right failure.

If I change the sync behaviour here https://github.com/cockroachdb/cockroach/blob/e5570cb99a5221dd80bb7cae6f92714d2d082755/pkg/kv/kvserver/replica_app_batch.go#L561-L564

to sync := true, the test stops panicking. This proves the conjecture that the root cause is the lack of state machine sync before removing the sideloaded files.

The fix will be setting sync to true if the truncation command touches any sideloaded files.

pav-kv commented 7 months ago

114191 now also has the fix. Hopefully will be able to polish and merge the whole thing this coming week.

erikgrinaker commented 6 months ago

Downgrading to non-blocker, pre-existing issue.