Fix theoretical corruption after fsync failure

cberner commented 1 year ago

See #445 and #299

Summary of my understanding of the issue: 1) read() and reads from mmap()'ed memory may return any value that was written since the last successful fsync()/msync() and cannot be assumed to return the latest value. The granularity is an OS page, so values that cross pages may be torn 2) if fsync() reports a failure (even in a different process), all pages written since the last successful fsync() must be assumed to have become poisoned and they must be explicitly re-initialized by writing to them.

Corollaries: 1) constructing a slice from an mmap'ed pointer is only safe if that memory is clean (successfully fsync()'ed). Otherwise, it is immediately UB 2) reads of dirty data (not yet fsync'ed) must be served from a buffer in userspace.

Using O_DIRECT might get around some of these issues, because it avoids the page cache. However, using it with mmap likely had undefined behavior.

casey commented 1 year ago

From #445:

For example, all three file systems mark pages clean after fsync fails, rendering techniques such as application-level retry ineffective.

This is quite weird to me. If the pages were still marked dirty, the application could just retry the fsync and if it succeeded all would be well. This makes me wonder if marking the pages clean somehow simplifies the implementation, makes it faster, etc.

However, the content in said clean pages varies depending on the file system; ext4 and XFS contain the latest copy in memory, while btrfs reverts to the previous consistent state.

I suppose that on-disk state is totally undefined, so you have to assume it's some mix of pre and post-write blocks. For 2PC, can't you just return an error from the commit after a failed fsync? But also, if there's an fsync in the middle of a transaction (not just on commit) then any outstanding references in rust code would be invalid, so that would be insta UB.

I'm pretty sure this means the 1PC+C strategy won't work, because if fsync can flush some blocks but rollback others it could flush the transaction block but rollback some/all of the data blocks :/

Yah, yikes.

cberner commented 1 year ago

This is quite weird to me. If the pages were still marked dirty, the application could just retry the fsync and if it succeeded all would be well. This makes me wonder if marking the pages clean somehow simplifies the implementation, makes it faster, etc.

Ya, the Postgres devs thought that too :) I watched this talk about it, which is pretty good. Ya, I found the answer to this, which is that the most common case is people pulling out USB drives, and so keeping the dirty pages around would lead to a permanent memory leak since the pages can't be flushed without the drive there.

The thing that doesn't make sense to me though is why the pages are marked clean and kept in the page cache. It seems like they should be evicted from the page cache to avoid poisoning it.

But also, if there's an fsync in the middle of a transaction (not just on commit) then any outstanding references in rust code would be invalid, so that would be insta UB

redb only fsync's at the end of a transaction, but ya I'm starting to wonder whether mmap'ed memory can be arbitrarily rolled back if there's memory pressure and flushing to disk fails. From that LWN article it looks like the Postgres devs were encouraged to use O_DIRECT and never use the page cache.

For 2PC, can't you just return an error from the commit after a failed fsync?

I don't think that's sufficient. If the on-disk data and page cache are silently out of sync, you have to assume all written pages are poisoned and need to be explicitly zero'ed so that the page cache is back in sync with the on-disk data. Even worse, there's discussion about how you can't even assume the page cache is working correctly when you first open() a file, because some other program could have had a failure and poisoned it. For example, using cp to copy a redb database file, having a failure, and then opening it with redb could be UB :exploding_head:

cberner commented 1 year ago

Ah, it looks like posix_fadvise(DONTNEED) can be used to clear out the page cache. The Linux implementation calls into invalidate_mapping_pages here: https://elixir.bootlin.com/linux/v6.0.10/source/mm/fadvise.c#L168

casey commented 1 year ago

but ya I'm starting to wonder whether mmap'ed memory can be arbitrarily rolled back if there's memory pressure and flushing to disk fails.

That seems like it just can't be true. If it was, how could using mmapped memory ever be safe? (On the other hand, so much is broken and weird so who knows.)

cberner commented 1 year ago

fuse doesn't seem to support mmap (or maybe the fuser crate just doesn't), so this doesn't 100% prove it, but this shows that you can definitely poison the page cache so that read() and read(O_DIRECT) return different values, and that you can change the value returned by read() by flushing the page cache. Since mmap uses the page cache, it seems highly likely that mmap'ed memory can mutate during a page flush.

page_cache_poisoning.zip

cberner / redb

Fix theoretical corruption after fsync failure #450