Open serathius opened 1 year ago
cc @ahrtr @ptabor
Please attached db file if possible.
No db file, only the mentioned dump with data redacted.
This issue seems similar to https://github.com/etcd-io/bbolt/issues/402
Under bucket branch page, in keys bucket there was a branch page linking to another bucket branch page.
It doesn't help to provide such vague info, please provide at least all related page IDs next time.
It's also most likely incorrect info, I do not see any dedicated alarm pages at all, since the pageID is 0, which means there is no any alarm or it's inline page.
One branch item somehow pointed to an old root page. The other abnormal point is that two meta pages pointed to the same root page (690
).
@ahrtr Just wondering, how did you draw that diagram? Is there any bbolt specific tool for that or did you use a general tool?
@ahrtr Just wondering, how did you draw that diagram? Is there any bbolt specific tool for that or did you use a general tool?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
What happened?
Etcd started crashlooping with
When analysing the db file I found invalid etcd db file layout. Under bucket branch page, in
keys
bucket there was a branch page linking to another bucket branch page. This resulted in bbolt returning keyalarm
when reading wholekeys
bucket. This is correct layout for bbolt, but not for etcd.From etcd point of view this is invalid as it assumes that all keys in
keys
bucket are revision numbers. Panic from above comes frombytesToRev
function that parses revision. It failed as main rev has 8 bytes, while key "alarm" has only 5 bytes.This means that at some point bbolt either:
We can't exclude hardware issue that resulted in memory corruption.
Providing the dump.txt for further investigation
What did you expect to happen?
Want to report the issue to start the discussion of etcd handling potential memory corruptions.
Assuming that this was indeed a memory corruption, I expect that should avoid writing corrupted page to disk. Running mmapped memory comes with risk with memory stamping, etcd should have mechanism that prevent corruption from being persisted.
Was discussing with @ptabor idea of protected mode for bbolt where it would verifying every write to ensure corruptions are not persisted.
How can we reproduce it (as minimally and precisely as possible)?
Don't think so.
Anything else we need to know?
No response
Etcd version (please run commands below)
v3.4.21
Etcd configuration (command line flags or environment variables)
Nothing unusual
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
N/A
Relevant log output
No response