Possible bug: Missing a fsync() on the ".log" file before compaction

GoogleCodeExporter commented 9 years ago

This bug is about leveldb not behaving "as expected" after a power crash 
happens on a well-configured Linux machine (ext4 filesystem, barriers on, and 
hard drive cache disabled).

I'm not sure if the "expectation" is correct though, so I'm mentioning it here. 
Consider that the application issues calls Put() twice, serially: { 
Put(key-value1); Put(key-value2); }. If a power loss happens at any point in 
time, after recovery, if key-value2 exists on the database, then key-value1 
SHOULD exist.

If the expectation I assumed is wrong, please ignore the rest of this bug 
report.

Background to the problem: 

When the log file grows beyond a threshold size, it seems like LevelDB creates 
a new log file and continues adding newer Put() requests to the new log file. 
The old log file is compacted into an ".sst" file in the background. The log 
files are sync()-ed only if the Put() requests are synchronous. If requests are 
asynchronous, the logs are never flushed explicitly by LevelDB.

Given the right timing, if a power loss happens, it is possible that the older 
log file was not flushed to the disk (and had still not been compacted), but 
the newer log file was flushed. Now, in practice, if all Put() requests are 
asynchronous, I'm not sure if this would happen given current Linux dirty page 
flushing policies. However, if all Put()-s to the older log file are 
asynchronous, but the ones to the newer file are synchronous, this will happen.

If the old log file never goes to disk, but the new one ends up on disk, after 
recovery from the power-failure, leveldb will end up silently restoring only 
the Put()-s that went to the new file (while discarding those to the old file). 
This violates my expectation.

I'm not sure if a similar bug would affect other levels of compaction. I have 
only looked at level-0. But, I don't expect it would.

What steps will reproduce the problem?
1. Use a Linux machine with ext4 (default mount options). Modify the leveldb 
source code so that the background compaction thread does a big sleep() call 
before updating the MANIFEST file.
2. Create a LevelDB database on a partition that is unused by other 
applications. Design a workload that issues a lot of asynchronous Put() 
requests, till the current log file gets filled up, and then issues one 
synchronous Put() request, such that the request goes to a new log file. Run 
the workload on the created database.
3. As soon as the workload finishes running, within the next 5 seconds (I think 
you can actually do within the next 30 seconds), switch off the machine by 
pulling the chord. After rebooting the machine, make LevelDB open the database 
and list all key-value pairs in the database.

What is the expected output? What do you see instead?
Expected output: Leveldb either lists all the key-value pairs, including that 
of the last synchronous Put() operation.

Observed output: Leveldb does list the pair corresponding to the last 
synchronous operation, but does not list older pairs.

What version of the product are you using? On what operating system?
LevelDB-1.12.0. Fedora 17.

Please provide any additional information below.
This bug is not a Linux issue, although I have described it that away. Probably 
fails in a lot of file-systems and OS configurations.

Original issue reported on code.google.com by madthanu@gmail.com on 17 Jul 2013 at 3:12

GoogleCodeExporter commented 9 years ago

The easy answer here is to say that if a client doesn't want this to happen, it 
shouldn't request an async write followed by a sync write. Further, there could 
be scenarios where the described behavior is desired: the first write is 
considered "unimportant" and the client prefers to risk losing it rather than 
waiting for the disk sync, but the second write is "important" and worth the 
performance hit to ensure it is not lost.

Who knows how common such a scenario is, but I think it's worth pointing out. 
There's also the question of whether offering clients such flexibility is worth 
the confusion that you encountered.

Original comment by dgrogan@chromium.org on 17 Jul 2013 at 6:58

GoogleCodeExporter commented 9 years ago

I'm usually wary of the exact guarantees given by any storage library when it 
comes to things like these, but here's the statement in the documentation that 
led me astray: "A hybrid scheme is also possible where every Nth write is 
synchronous, and in the event of a crash, the bulk load is restarted just after 
the last synchronous write finished by the previous run. (The synchronous write 
can update a marker that describes where to restart on a crash.)"

As I understand it, that "scheme" just won't work unless it is guaranteed that 
the async writes are ordered before the synchronous writes. 

Also, I assumed that (I have no idea where I got that assumption from, it was 
probably from some old emails in the mailing list),
(1) given a couple of asynchronous Put()s that are all going to the same log 
file, and,
(2) if a power-loss happens and some of them hadn't actually reached the disk, 
then,
(3) during recovery LevelDB detects those Put()s that have not gone to the 
disk, and *discards* all future Put()s, to ensure that the ordering of the Put 
requests is maintained.

Is that not true?

I also have some opinion on what the default guarantee should be, particularly 
about performance, but I guess I should rather post that in the mailing list, 
right? (Or is this the right place to do it?)

Original comment by madthanu@gmail.com on 17 Jul 2013 at 8:23

GoogleCodeExporter commented 9 years ago

I had forgotten about that suggestion.

A nonsurgical fix for this issue would be to add a `logfile_->Sync()` just 
before `delete logfile_` near the end of DBImpl::MakeRoomForWrite.

As for your hypothetical steps, I don't think that's how it works. If those 
Put()s hadn't made it to the log file on disk, how would LevelDB know about 
them during recovery after a reboot?

Original comment by dgrogan@chromium.org on 18 Jul 2013 at 12:18

GoogleCodeExporter commented 9 years ago

As far as I understand the source code, that should probably be the easiest fix.

The hypothetical steps: Well, I was assuming, since there checksums for records 
in the log file, they'd be checked when you recover. If a particular user 
record hadn't gone to disk fully, the checksum wouldn't match, and you can 
detect and discard that record. Just out of curiosity, is at least this much of 
my assumption correct? Or are there chances that partly-formed records might be 
recovered as well? I was also assuming that you'd discard all future records, 
but I am obviously wrong on that (based on your views).

I should note here that if my hypothetical "discard every future record" steps 
are wrong, and somehow LevelDB does recover every checksum-valid record, there 
is a chance (albeit very tiny in the real world) that the "scheme" from the 
documentation does not work. Not sure ... just mentioning. I can imagine the 
Nth synchronous write fully going to disk (even before the msync completely 
returns), but the previous asynchronous record (in the same log file) not 
having persisted during the power-failure.

Original comment by madthanu@gmail.com on 18 Jul 2013 at 12:40

GoogleCodeExporter commented 9 years ago

In version 1.13, SyncDirIfManifest() method is created and is called by Sync().

SyncDirIfManifest() flushes current directory if the target file is started 
with "MANIFEST".
As my understand, manifest file is changed at every compaction time.
It means that old log file would not be missed after power loss by this change.
I think this change can solve the whole data loss of old log file, at least.

Original comment by yongilj...@gmail.com on 18 Jul 2014 at 12:30

GoogleCodeExporter commented 9 years ago

If I understand it correctly, SyncDirIfManifest() does an fsync() on the 
directory. This issue is not about such an fsync(); it is about an fsync() on 
the log file itself. Particularly, when switching log files, the new log file 
might be written to (and pushed to the disk by the OS) without the old log file 
ever being pushed to disk (since no fsync() was done on it).

Saying that, the issue might be fixed in the current version; I am not sure. 
When a Put() results in a log compaction, can it return before the old log file 
gets compacted? Just looking at a simple strace, it does seem to.

Original comment by madthanu@gmail.com on 23 Jul 2014 at 3:20

GoogleCodeExporter commented 9 years ago

@madthanu
Yes, you are right.

My point was we still can loose some of data in a log file if we don't use 
"sync: true" options but not full data in a log file.

My comment mentioned about loosing files that are not flushed to inode about 
current directory. In  compaction, old log file would be removed and new sst 
file would be created. But, I think if we don't call fsync() for directory then 
files can be missed from directory because of this changed information is not 
flushed.

In this issue, you mentioned about loosing contents of old log files. However, 
I think it is occurred by new sst file. sst file is created but not flushed to 
directory and it can be loosed by power off.
(Why also old log file is removed? Hmmmm I don't know... )

Original comment by yongilj...@gmail.com on 10 Aug 2014 at 11:31

Lendfating / leveldb

Possible bug: Missing a fsync() on the ".log" file before compaction #187