internal/record: use direct I/O for the WAL and MANIFEST

jbowens commented 3 years ago

WAL and MANIFEST might be good candidates for using direct I/O. The LogWriter already handles organizing writes into contiguous blocks. I'm not sure what impact, if any, on performance direct I/O would have.

I do think it would allow us to retry failed syncs during WAL and MANIFEST writes: DB.Apply Fatalf, logAndApply Fatalfs

Currently, errors in these codepaths are fatal because fsyncs of OS-buffered files cannot be retried. The OS marks errored buffers as clean, meaning a retried fsync will not sync the buffer and the file's contents remain unchanged regardless of a retry. https://wiki.postgresql.org/wiki/Fsync_Errors

This was motivated by thinking about @sumeerbhola's automated ballast file suggestion for detecting out-of-disk conditions. It's a really nice solution. The one sticking point is that an ENOSPC may occur during fsync. I'm not sure under what conditions ENOSPC may surface from fsync rather than the preceding write, but I suspect it may happen when the filesystem needs to allocate new metadata blocks. I'm not sure but maybe on copy-on-write file systems all block allocations happen during fsync?

Jira issue: PEBBLE-211

petermattis commented 3 years ago

WAL and MANIFEST might be good candidates for using direct I/O. The LogWriter already handles organizing writes into contiguous blocks. I'm not sure what impact, if any, on performance direct I/O would have.

I did testing a while ago which showed that direct I/O writes had similar sync latency to recycled log fsyncs. The latency was actually a few percent better for direct I/O, so there would be a perf win, but not a dramatic one. Direct I/O also uses less CPU. Direct I/O writes: the best way to improve your credit score is an interesting recent blog post on this topic.

petermattis commented 3 years ago

https://github.com/cockroachdb/pebble/issues/41#issuecomment-484259680 is a useful comment on the intricacies of direct I/O.

jbowens commented 2 years ago

I've seen it mentioned a few times that in some systems a fsync on a file waits for all dirty pages to be flushed, not just the file that the fsync was requested on. There's some reference to it here. Direct I/O for the WAL seems like it would insulate WAL commits from latency spikes due to a glut of dirty pages elsewhere.

sumeerbhola commented 1 year ago

Also see the discussion and related links to other discussions on https://github.com/cockroachdb/cockroach/issues/88442#issuecomment-1302716873

jbowens commented 1 year ago

There's an interaction with disk stall detection that may have been obvious to others but eluded me. When we write through the page cache, I/O may occur outside the context of Cockroach syscall. If a background kernel thread is performing the write back and stalls, Cockroach is oblivious. Direct I/O would ensure all Cockroach's I/O is directly timed.

Maybe this distinction is insignificant, because eventually Cockroach should always issue a timed fsync which should block on the in-progress writeback.

cockroachdb / pebble

internal/record: use direct I/O for the WAL and MANIFEST #1159