Open calebsander opened 3 months ago
Implementing registered buffers support for zc sendmsg is on top of my todo list. And you're right that replacing it with multiple small sends is a bad idea for performance. FWIW, one thing it doesn't help with is DMA, there were early prototypes before plumbing dma buf into the path, and it hasn't been upstreamed.
As for mixing copy and zerocopy, I'll need to think how to pass a hint and if we can do that. Was mentioned before privately, it makes sense for when you have a small header + payload.
It's a bit aside from the original topic but still could be considered related.
From man io_uring_enter
I can see following:
EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submission queue entry, but the io_uring instance has fixed buffers registered.
Is this limitation actual (I don't see a check in kernel 6.1 source, but I easily may just miss it)? And if so, is this a fundamental limitation?
It's a bit aside from the original topic but still could be considered related. From
man io_uring_register
I can see following:EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submission queue entry, but the io_uring instance has fixed registered.
I have hard time reading it even more so understanding what it means. Where you get it from, I can't find any mention of IORING_OP_READV
in io_uring_register.2
Sorry, while trying to make it readable from formatting perspective I dropped the key word from quotation (fixed).
As an example https://man7.org/linux/man-pages/man2/io_uring_enter.2.html -- see EINVAL
cases.
And surely it's about io_uring_enter
, not io_uring_register
.
Sorry, while trying to make it readable from formatting perspective I dropped the key work from quotation (fixed). As an example https://man7.org/linux/man-pages/man2/io_uring_enter.2.html -- see
EINVAL
cases.
I see the line, but I don't understand where it came from. I don't remember such a restriction at any point in time, and it's surely not true, you can freely mix e.g. READV and FIXED_READ within a single ring, registering buffers changes nothing for requests not using the feature.
@axboe was this implemented? Or are you closing it as "won't fix"?
Sorry, was probably a bit liberal in terms of closing old issues. We can keep this one open. It's not implemented.
I'll take care of what I mentioned. I hoped I could delegate, but it's not going as planned.
I would like to register my interest for an io_uring_prep_writev_fixed
as well, or if I may be so bold as to request io_uring_prep_writev_zc_fixed
. We have a scenario where a writev
is desired for reducing overhead of sending parts of an RPC protocol over a socket, to avoid the overhead of concatenating buffers before send (or sending multiple buffers)
Are either of these functions on the roadmap?
The writev variant should be easy enough to do, Pavel is going the sendmsg variant right now. For zc, that's just O_DIRECT when it's for files, nothing needed beyond writev fixed.
The writev variant should be easy enough to do, Pavel is going the sendmsg variant right now. For zc, that's just O_DIRECT when it's for files, nothing needed beyond writev fixed.
The writev variant should be easy enough to do, Pavel is doing the sendmsg variant right now. For zc, that's just O_DIRECT when it's for files, nothing needed beyond writev fixed.
"Doing right now" means the patchset is ready, but might get brushed up here and there. Here it is:
https://lore.kernel.org/io-uring/527e3fa3-cfcb-437a-80b1-1526358abcd6@gmail.com/T/
The caveat is, as Jens mentioned, it's only useful for zero copy, i.e. SEND_ZC and O_DIRECT read/write. If you writev to a socket, it'll be copied anyway and registered buffers won't be on any help to performance.
We are using io_uring for TCP socket I/O in our application and have been experimenting with using io_uring's zero-copy send support. First, thank you all for the hard work put into io_uring! The performance improvements are truly impressive, and it's neat to see how many diverse use cases io_uring can support.
For some background, our application sends short headers (24 bytes) optionally followed by data (up to 512 KB) on TCP sockets. The headers and data live in different regions of memory. Since a header and data are ready to send at the same time, we find using a vectorized socket send operation (i.e.
io_uring_prep_writev()
) to send them together improves performance. We also use the vectorized send operations to coalesce multiple headers (and possibly data) that become ready to send in a short time span; we find this saves a lot of CPU overhead over sending each one individually. On workloads sending a lot of data, we see significant CPU time spent copying the userspace buffers into the kernel socket buffers, which motivates us to switch to zero-copy sends. Enabling zero-copy for all sends (i.e. switching toio_uring_prep_sendmsg_zc()
) provides a great performance improvement on those workloads. However, it slightly increases the CPU usage of workloads sending only headers. (And it is likely limiting the improvement we can get on workloads sending data, since every data buffer has a corresponding header that also incurs the zero-copy overhead.)We would love to be able to selectively enable zero-copy for the long data buffers and disable it for the headers. As far as I'm aware, io_uring doesn't currently have an interface for a vectorized send where some iovecs use zero-copy and some don't. Is that correct? Does the kernel's socket layer support an operation like that, or would it require significant changes to plumb it down? In a similar vein, we would like to pre-register the data buffers to avoid the DMA mapping cost on every send. But it doesn't look like registered buffers are supported with vectorized operations (liburing doesn't have a
_fixed
variant forio_uring_prep_writev()
orio_uring_prep_sendmsg_zc()
like it does forio_uring_prep_write()
andio_uring_prep_send_zc()
). Is this because a vectorized operation can have multiple iovecs but there's only space for one registered buffer index in the SQE? Although a general solution to be able to specify a per-iovec buffer index (or not use fixed buffers) would be great, our use case doesn't quite require that. All the data iovecs come from a single contiguous memory region, so we could probably get by with a single buffer index in the SQE and having io_uring use the registered buffer for all the iovecs inside it.I did try replacing the vectorized send operations with linked
io_uring_prep_send()
operations so the zero-copy and registered buffer settings could be separately configured on each one. (This requires theMSG_WAITALL
flag, and I also sent theMSG_MORE
flag for all SQEs before the last as a hint that the socket layer should wait for the subsequent sends before sending the data.) Unfortunately, the performance was dismal, much worse than usingio_uring_prep_writev()
without zero-copy. From the CPU profile, it looks like making more calls into the socket layer (even if not immediately calling into the driver to send the data) burns a lot of CPU time. And I also see the linked send SQEs not getting kicked off until returning from theio_uring_enter()
syscall, so it seems like they are not all being issued synchronously.Please let me know your thoughts--is it already possible to use different settings to send each iovec? Is there an alternate approach you would recommend? If this use case seems too specific to be worth supporting upstream, we could also roll our own kernel patch to support it in io_uring.
Thanks!