Closed andrewbaptist closed 6 months ago
cc @cockroachdb/replication
It wasn't clear if the file was written and then deleted or never written at all
Good point regarding deletion. I explored this, and it seems like ~one of the possible scenarios is~:
@erikgrinaker @nvanbenschoten WDYT, does this sound plausible?
Specifically, this assumption got broken after introduction of async log writes: https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L295-L296
Hi @pavelkalinnikov, please add branch-* labels to identify which branch(es) this release-blocker affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
Ah no, nonBlockingSync
can't be true when overwriting
:
https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L248
False alarm.
There still something weird here. Would appreciate a second pair of eyes on the following:
Even though we don't use async log write when there are log entries to be overwritten https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L239-L252
we still may commit without sync and proceed to deleting sideloaded entries:
If any of these lines incorrectly assign false, we may have a bug. https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/logstore/logstore.go#L236-L237
It's hard to tell exactly whether len(m.Responses) > 0
is a safe check to rely on. It's assigned in etcd/raft as follows: https://github.com/etcd-io/raft/blob/f3fa38bee62ec85ed7c9ee36c1073348c845bf4f/rawnode.go#L248-L260.
Previously, we relied on the MustSync
field which was well defined: https://github.com/etcd-io/raft/blob/f3fa38bee62ec85ed7c9ee36c1073348c845bf4f/rawnode.go#L163.
Another suspect for a sneaky deletion is https://github.com/cockroachdb/cockroach/blob/6a447621356ba0366cc0db15cb3dcf26327d1152/pkg/kv/kvserver/replica_application_result.go#L488-L495 although it promises that it's safe.
It would be worth double checking though that the truncated state is indeed synced before we get to this line.
Upd: well, we don't sync. https://github.com/cockroachdb/cockroach/blob/6a447621356ba0366cc0db15cb3dcf26327d1152/pkg/kv/kvserver/replica_app_batch.go#L557-L568
So, here is a new scenario to try:
This issue is possibly related to #38566 (and #36414 linked from it).
Managed to reproduce this panic in #114191. Going to fix it and polish the test.
The panic in #114191 is due to a different issue: #114411. The latter should be fixed first.
Now #114191 reproduces the right failure.
If I change the sync behaviour here https://github.com/cockroachdb/cockroach/blob/e5570cb99a5221dd80bb7cae6f92714d2d082755/pkg/kv/kvserver/replica_app_batch.go#L561-L564
to sync := true
, the test stops panicking. This proves the conjecture that the root cause is the lack of state machine sync before removing the sideloaded files.
The fix will be setting sync
to true if the truncation command touches any sideloaded files.
Downgrading to non-blocker, pre-existing issue.
Describe the problem
In a customer environment, a restore caused the disks to fill up which resulted in many of the nodes crashing due to full disks / disk stalls (Note: This should probably be a separate issue, as this shouldn't cause a crash). During one of the crashes, we "lost" a side-loaded file. It wasn't clear if the file was written and then deleted or never written at all, but either way, there was an "unprocessed" Raft entry pointing to the file, and the file was gone. This caused the system to repeatedly crash loop until the node was taken offline and replaced by a different file.
The panic error is:
loading sideloaded data: sideloaded file not found.
To Reproduce
I have not attempted to reproduce it yet, but it would be worth trying a db restore and repeatedly killing one of the nodes during the restore using
kill -9
. We should also test with a full reboot of the OS as it may catch other issues as well.Expected behavior The side loaded files and the Raft log should be consistent.
Environment:
Jira issue: CRDB-32766