axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.86k stars 402 forks source link

Ability to pass file range to IORING_OP_FSYNC is undocumented #990

Closed RedBeard0531 closed 1 month ago

RedBeard0531 commented 11 months ago

While looking through the implementation in the kernel, I noticed that the code takes the off and len flags from the sqe, and passes them to vfs_fsync_range. I'm not an expert in the linux kernel, but based on the fact that it is still calling vfs_fsync_range rather than sync_file_range, so I'm hopeful the it avoids the pitfall that makes the latter completely unusable for durability. This is however completely undocumented, so I'd be a bit nervous relying on it. Was it a documentation oversight, or was it intentionally undocumented because it shouldn't be used? If intentional, should the code be removed from the kernel? If an oversight, should it also be exposed from liburing?

For context, I'm interested in this for forcing durability in an append-only* write-ahead log. The application is constantly pushing more data onto the file using IORING_OP_WRITEV, and also tries to always have a IORING_OP_FSYNC in flight (as long as there are unsynced writes). As soon as the CQE for one fsync comes in, I submit another one with IO_SQE_DRAIN and remember the ending offset of the last submitted write. Because I am holding up notification of durability for all ops written before that offset until the CQE comes back, I want it to come back as quickly as possible, and ideally not wait for any writes past that offset (which will still be submitted while the fsync is in flight), since they won't be able to be acknowledged anyway until the next fsync completes. Is this a valid use case for setting off and len on the SQE?

* Technically it is append-only at the logical level, but not at the FS layer. We preallocate the file by writing zero blocks before storing any actual data because that seems to be the only way to get fsyncs to complete with a latency of 1 disk write rather than 2 (at least on xfs and ext4). Even using fallocate still needs to do metadata writes.

axboe commented 11 months ago

It should be supported like that, it's just the documentation that is lacking. If you are so inclined, please submit a man page update for it!