Behavior after "no space left on device" - auto-recovery not working

jsteemann commented 2 years ago

It looks to me that RocksDB's auto-recovery functionality after htting "no space left on device" does not work with the PessimisticTransactionDB and write policy WRITE_COMMITTED. When the disk runs full and RocksDB gets an error on filesystem operation, it tracks that there has been a background error and makes all subsequent write operations fail. That is fine so far. However, RocksDB stays in background error mode forever, even after space is made available on the underlying filesystem. Even calling Resume() on the db object manually does not fix the problem. I have observed this issue in many different versions of RocksDB, including the very latest state of main (which recently had some fixes for auto-recovery). I am very sure that this is not a filesystem issue, as I have observed the problem on many filesystems over years.

It seems to me that the only possible way to convince RocksDB to reset the background error is to remove the allow_2pc flag from the options. With that change made, auto-recovery actually works fine. Unfortunately the allow_2pc flag is hard-coded when using PessimisticTransactionDB, as https://github.com/facebook/rocksdb/blob/main/utilities/transactions/pessimistic_transaction_db.cc#L293 unconditionally sets the flag to true.

Expected behavior

RocksDB should auto-recover after space has been made available in the underlying filesystem.

Actual behavior

It doesn't. Even calling Resume() on the db object manually does not help.

Steps to reproduce the behavior

The issue can be easily reproduced in many versions of RocksDB, including the very latest state of main. To test, I created a 2GB local tempfs mount and used it as RocksDB's data directory. I also created a few files containing garbage in that directory, that later can be deleted after RocksDB reports the ENOSPC error. That way it can be tested easily if RocksDB comes back after space has been made available again:

rm -rf ./tmpfs-dir
mkdir ./tmpfs-dir
chmod -R 777 ./tmpfs-dir
sudo mount -t tmpfs -o size=2048m tmpfs ./tmpfs-dir

dd if=/dev/urandom of=./tmpfs-dir/garbage1 bs=1048576 count=256
dd if=/dev/urandom of=./tmpfs-dir/garbage2 bs=1048576 count=256
dd if=/dev/urandom of=./tmpfs-dir/garbage3 bs=1048576 count=256
dd if=/dev/urandom of=./tmpfs-dir/garbage4 bs=1048576 count=256

After that, start up RocksDB and use a PessimisticTransactionDB and write policy WRITE_COMMITTED. Write as much data into RocksDB that the 2GB tempfs directory runs full and RocksDB starts reporting ENOSPC errors. Once that has happened, remove the files tmpfs-dir/garbage1 to tmpfs-dir/garbage2. Check that du -hs tmpfs-dir correctly reports the free space, but RocksDB actually keeps reporting the ENOSPC error forever.

Auto-recovery can be fixed by applying the following patch, which removes the hard-coding of the allow_2pc flag:

diff --git a/utilities/transactions/pessimistic_transaction_db.cc b/utilities/transactions/pessimistic_transaction_db.cc
index c1e3a2ab2ec..e06b6c73376 100644
--- a/utilities/transactions/pessimistic_transaction_db.cc
+++ b/utilities/transactions/pessimistic_transaction_db.cc
@@ -249,6 +249,10 @@ Status TransactionDB::Open(
   DBOptions db_options_2pc = db_options;
   PrepareWrap(&db_options_2pc, &column_families_copy,
               &compaction_enabled_cf_indices);
+  if (txn_db_options.write_policy == WRITE_PREPARED ||
+      txn_db_options.write_policy == WRITE_UNPREPARED) {
+    db_options_2pc.allow_2pc = true;
+  }
   const bool use_seq_per_batch =
       txn_db_options.write_policy == WRITE_PREPARED ||
       txn_db_options.write_policy == WRITE_UNPREPARED;
@@ -290,7 +294,7 @@ void TransactionDB::PrepareWrap(
       compaction_enabled_cf_indices->push_back(i);
     }
   }
-  db_options->allow_2pc = true;
+//  db_options->allow_2pc = true;
 }

 namespace {

I don't know if using that patch is actually safe (probably it isn't). For now I was just interested in what caused the problem.

Having a working auto-recovery with the PessimisticTransactionDB would be great, because it is very confusing to see RocksDB report ENOSPC errors when there is actually a lot of free disk space. The current behavior has caused lots of operational issues for us over the past few years, so getting rid of it would be a great step forward!

ltamasi commented 2 years ago

Cc @anand1976

jsteemann commented 2 years ago

This bug is still causing us a lot of operational issues and a lot of "unnecessary" support cases, e.g.

customer's disk runs full
RocksDB starts to report "no space left on device" errors (correctly)
monitoring goes off
free space is made available, by customer
RocksDB still reports "no space left on device" although disk again has enough free space
customer gets confused if not angry

Is there any way forward to get out of this bad situation at some point? I would love to see this fixed or at least mitigated somehow. Thanks!

jsteemann commented 2 years ago

Any idea if/when this issue can be investigated or even fixed? Thanks!

anand1976 commented 2 years ago

@jsteemann Sorry for the delay! I'll get back to you early next week.

neunhoef commented 2 years ago

@anand1976 Is there any progress on this? It is bothering us a lot with customer databases in ArangoDB.

anand1976 commented 2 years ago

The error recovery behavior when 2pc is enabled seems overly restrictive - https://github.com/facebook/rocksdb/blob/main/db/error_handler.cc#L513. For the short-term, we could relax it so certain types of errors, such as errors when writing an SST file or MANIFEST can be recovered from.

For WAL write errors when allow_2pc is true, recovery may be a bit more tricky. Cc @riversand963 for his opinion on how it should be handled. How should we handle transactions with prepare records in the WAL, but haven't committed yet and we cannot guarantee durability of the WAL?

jsteemann commented 1 year ago

Hi everyone, it has been a while, but the issue is still unresolved/uncommented. We would really appreciate if this could be addressed for one of the upcoming RocksDB releases. Thanks!

asifkazi commented 1 year ago

Hey guys circling back on this issue, it's just a painful and fixable issue I would think. Is there any planned resolution and ETA on this?

jsteemann commented 1 year ago

Hi everyone, is there anything that can be done about this issue? It is still causing us lots of problems that auto-recovery does not work with PessimisticTransactionDB even in WRITE_COMMITTED mode. Thanks!

anmolmore commented 3 months ago

Faced same issue, two years and no solution

arangodb-tom commented 1 month ago

This issue keeps reoccurring. Is there an update when it can be investigated and hopefully resolved? Thank you!

facebook / rocksdb