axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.77k stars 398 forks source link

Per-iovec zero-copy and fixed buffer settings for writev/sendmsg #1191

Open calebsander opened 1 month ago

calebsander commented 1 month ago

We are using io_uring for TCP socket I/O in our application and have been experimenting with using io_uring's zero-copy send support. First, thank you all for the hard work put into io_uring! The performance improvements are truly impressive, and it's neat to see how many diverse use cases io_uring can support.

For some background, our application sends short headers (24 bytes) optionally followed by data (up to 512 KB) on TCP sockets. The headers and data live in different regions of memory. Since a header and data are ready to send at the same time, we find using a vectorized socket send operation (i.e. io_uring_prep_writev()) to send them together improves performance. We also use the vectorized send operations to coalesce multiple headers (and possibly data) that become ready to send in a short time span; we find this saves a lot of CPU overhead over sending each one individually. On workloads sending a lot of data, we see significant CPU time spent copying the userspace buffers into the kernel socket buffers, which motivates us to switch to zero-copy sends. Enabling zero-copy for all sends (i.e. switching to io_uring_prep_sendmsg_zc()) provides a great performance improvement on those workloads. However, it slightly increases the CPU usage of workloads sending only headers. (And it is likely limiting the improvement we can get on workloads sending data, since every data buffer has a corresponding header that also incurs the zero-copy overhead.)

We would love to be able to selectively enable zero-copy for the long data buffers and disable it for the headers. As far as I'm aware, io_uring doesn't currently have an interface for a vectorized send where some iovecs use zero-copy and some don't. Is that correct? Does the kernel's socket layer support an operation like that, or would it require significant changes to plumb it down? In a similar vein, we would like to pre-register the data buffers to avoid the DMA mapping cost on every send. But it doesn't look like registered buffers are supported with vectorized operations (liburing doesn't have a _fixed variant for io_uring_prep_writev() or io_uring_prep_sendmsg_zc() like it does for io_uring_prep_write() and io_uring_prep_send_zc()). Is this because a vectorized operation can have multiple iovecs but there's only space for one registered buffer index in the SQE? Although a general solution to be able to specify a per-iovec buffer index (or not use fixed buffers) would be great, our use case doesn't quite require that. All the data iovecs come from a single contiguous memory region, so we could probably get by with a single buffer index in the SQE and having io_uring use the registered buffer for all the iovecs inside it.

I did try replacing the vectorized send operations with linked io_uring_prep_send() operations so the zero-copy and registered buffer settings could be separately configured on each one. (This requires the MSG_WAITALL flag, and I also sent the MSG_MORE flag for all SQEs before the last as a hint that the socket layer should wait for the subsequent sends before sending the data.) Unfortunately, the performance was dismal, much worse than using io_uring_prep_writev() without zero-copy. From the CPU profile, it looks like making more calls into the socket layer (even if not immediately calling into the driver to send the data) burns a lot of CPU time. And I also see the linked send SQEs not getting kicked off until returning from the io_uring_enter() syscall, so it seems like they are not all being issued synchronously.

Please let me know your thoughts--is it already possible to use different settings to send each iovec? Is there an alternate approach you would recommend? If this use case seems too specific to be worth supporting upstream, we could also roll our own kernel patch to support it in io_uring.

Thanks!

isilence commented 1 month ago

Implementing registered buffers support for zc sendmsg is on top of my todo list. And you're right that replacing it with multiple small sends is a bad idea for performance. FWIW, one thing it doesn't help with is DMA, there were early prototypes before plumbing dma buf into the path, and it hasn't been upstreamed.

As for mixing copy and zerocopy, I'll need to think how to pass a hint and if we can do that. Was mentioned before privately, it makes sense for when you have a small header + payload.

vsolontsov-ll commented 1 month ago

It's a bit aside from the original topic but still could be considered related. From man io_uring_enter I can see following:

   EINVAL         IORING_OP_READV or IORING_OP_WRITEV was specified in the submission queue entry, 
   but the io_uring instance has fixed buffers registered.

Is this limitation actual (I don't see a check in kernel 6.1 source, but I easily may just miss it)? And if so, is this a fundamental limitation?

isilence commented 1 month ago

It's a bit aside from the original topic but still could be considered related. From man io_uring_register I can see following:

   EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submission queue entry, but the io_uring instance has fixed registered.

I have hard time reading it even more so understanding what it means. Where you get it from, I can't find any mention of IORING_OP_READV in io_uring_register.2

vsolontsov-ll commented 1 month ago

Sorry, while trying to make it readable from formatting perspective I dropped the key word from quotation (fixed). As an example https://man7.org/linux/man-pages/man2/io_uring_enter.2.html -- see EINVAL cases.

And surely it's about io_uring_enter, not io_uring_register.

isilence commented 1 month ago

Sorry, while trying to make it readable from formatting perspective I dropped the key work from quotation (fixed). As an example https://man7.org/linux/man-pages/man2/io_uring_enter.2.html -- see EINVAL cases.

I see the line, but I don't understand where it came from. I don't remember such a restriction at any point in time, and it's surely not true, you can freely mix e.g. READV and FIXED_READ within a single ring, registering buffers changes nothing for requests not using the feature.