Open ishan16696 opened 1 year ago
I have manually tried to mimic this tests by manually corrupting the db file of any etcd cluster member but I didn't observe this bolt db panic in my local testing and single member restoration in cluster etcd was successful.
Describe the bug: This etcd-druid e2e tests failed to pass as it leads to the etcd and backup-restore into an unrecoverable state.
Root cause analysis: It has been observed that when
db file
of etcd data-dir has been corrupted and when backup-restore tries to verify db by opening the boltDB then panic occurs in BoltDB as BoltDB is not expecting a corrupted DB file as it contains the information of pages and other metadata information. All this leads to etcd and backup-restore into an unrecoverable state.Expected behavior: If
db file
of etcd data-dir has been corrupted then backup-restore should detect this and should be able to restore the data-dir of etcd in case of cluster size=1 and in case of cluster size>1 then backup-restore should able to trigger the single member restoration scenario.How To Reproduce (as minimally and precisely as possible):
Logs: Backup-restore container logs:
full logs of etcd-backup-restore:
Logs
Environment (please complete the following information):
Anything else we need to know?: