Closed aphyr closed 2 years ago
@aphyr Did you manually & intentionally modify the db
file?
Yup! That's what Jepsen is for--fault injection testing. You might recall etcd contracting me to do this same kind of work in 2019. :-)
Thanks for the feedback. The BoltDB file isn't supposed to be manually modified, even in test. Just curious, had you ever seen such issue previously when you did the same test?
I mean yeah, sure, hardware is supposed to be perfect! However, non-ECC machines, disks, faulty network controllers, bad VM hypervisors, et al do occasionally cause bit-flip errors. Given that etcd has already done some work to detect these kinds of errors (for instance, given the CRC checks and --experimental-initial-corrupt-check
flag) I figure this might be within y'all's fault model. If nothing else, it points to a fruitful avenue for testing the corruption checker in the future.
And no, our previous tests didn't perform this type of fault injection. This is new work, motivated by faults that real systems have exhibited in the past.
Note that CRC
check is only for the WAL file for now, and --experimental-initial-corrupt-check
can't resolve data corruption, and the existing implementation also has flaw. Of course, we are working to improve it.
The BoltDB manages data via B+ tree page by page, and it has delicate structure. The data file can only be updated by BoltDB itself.
We do see a couple of data file corruption issue previously, i.e. 13406. I may consider to deliver a tool to automatically recover & fix the corrupted data file in future.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
What happened?
With etcd 3.5.3, running with
--experimental-initial-corrupt-check
, starting etcd with disk files that have had a few (p(flip) =~ 0.001) random bit flips can cause all kinds of exciting behavior. Sometimes the corruption check detects the checksum mismatch and panics properly:Or
However, we can also induce SIGSEGV and SIGBUS crashes. Here's an example Jepsen run with full logs and tarballs of the data dir available if you'd like to see for yourself. Here's a segfault:
And a SIGBUS:
Other times we get a panic because of an index out of range:
Or, in this test, a panic involving "invalid type: meta":
Or messages about illegal tag 0
Or panic: cannot use none as id:
What did you expect to happen?
I know that corruption checks are experimental, so this bug is likely a low-priority issue, but it does feel like a corrupt data file should result in something a little less spooky than a SIGBUS or SIGSEGV.
How can we reproduce it (as minimally and precisely as possible)?
Check out https://github.com/jepsen-io/etcd 9624a6cebb051856622b27bd3878b1b2797d9fe6 and run (e.g.)
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Each node is started without a config file, using CLI flags like
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response