Open jbowens opened 3 years ago
WAL and MANIFEST might be good candidates for using direct I/O. The LogWriter already handles organizing writes into contiguous blocks. I'm not sure what impact, if any, on performance direct I/O would have.
I did testing a while ago which showed that direct I/O writes had similar sync latency to recycled log fsyncs. The latency was actually a few percent better for direct I/O, so there would be a perf win, but not a dramatic one. Direct I/O also uses less CPU. Direct I/O writes: the best way to improve your credit score is an interesting recent blog post on this topic.
https://github.com/cockroachdb/pebble/issues/41#issuecomment-484259680 is a useful comment on the intricacies of direct I/O.
I've seen it mentioned a few times that in some systems a fsync
on a file waits for all dirty pages to be flushed, not just the file that the fsync was requested on. There's some reference to it here. Direct I/O for the WAL seems like it would insulate WAL commits from latency spikes due to a glut of dirty pages elsewhere.
Also see the discussion and related links to other discussions on https://github.com/cockroachdb/cockroach/issues/88442#issuecomment-1302716873
There's an interaction with disk stall detection that may have been obvious to others but eluded me. When we write through the page cache, I/O may occur outside the context of Cockroach syscall. If a background kernel thread is performing the write back and stalls, Cockroach is oblivious. Direct I/O would ensure all Cockroach's I/O is directly timed.
Maybe this distinction is insignificant, because eventually Cockroach should always issue a timed fsync
which should block on the in-progress writeback.
WAL and MANIFEST might be good candidates for using direct I/O. The
LogWriter
already handles organizing writes into contiguous blocks. I'm not sure what impact, if any, on performance direct I/O would have.I do think it would allow us to retry failed syncs during WAL and MANIFEST writes: DB.Apply Fatalf, logAndApply Fatalfs
Currently, errors in these codepaths are fatal because
fsync
s of OS-buffered files cannot be retried. The OS marks errored buffers as clean, meaning a retriedfsync
will not sync the buffer and the file's contents remain unchanged regardless of a retry. https://wiki.postgresql.org/wiki/Fsync_ErrorsThis was motivated by thinking about @sumeerbhola's automated ballast file suggestion for detecting out-of-disk conditions. It's a really nice solution. The one sticking point is that an
ENOSPC
may occur duringfsync
. I'm not sure under what conditionsENOSPC
may surface fromfsync
rather than the precedingwrite
, but I suspect it may happen when the filesystem needs to allocate new metadata blocks. I'm not sure but maybe on copy-on-write file systems all block allocations happen duringfsync
?Jira issue: PEBBLE-211