facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
27.85k stars 6.2k forks source link

Fix DB failed to resume after "no space left on device" error #12767

Open YadongWang-a opened 3 weeks ago

YadongWang-a commented 3 weeks ago

Fix 11643

Summary: The cause of this issue is that after recovery from "no space" problem, the seen_error flag in the WritableFileWriter was not reset. IMO that the seen_error flag is used to prevent frequent write retries when an error is present. A similar situation can be referenced in SyncWalImpl, where error_recovery_in_prog is true it also been reset. Therefore, it is acceptable to reset it in ResumeImpl. Considering that a successful resume is required and it needs to be done before 'OnErrorRecoveryCompleted', the changes are as follow.

Test Plan: Added a test case 'NoSpaceOnWriteWalAndRecovery' in 'db_io_failure_test.cc' to test the "no space" error and recovery when writing WAL. Modified 'db_test_util.h' to simulate the "no space" error when appending WAL.

jsteemann commented 3 weeks ago

Side note: I tried the change to check if it would fix https://github.com/facebook/rocksdb/issues/9762 as well, but it doesn't.