axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.89k stars 404 forks source link

Write atomicity of `IORING_OP_WRITEV` #1282

Open gootorov opened 1 day ago

gootorov commented 1 day ago

Previous discussion: https://github.com/axboe/liburing/issues/1001#issuecomment-2480618839 https://github.com/axboe/liburing/issues/1001#issuecomment-2480735310

Consider sending messages over a single TCP socket. For simplicity, let's say messages are strings and they look like this:

To send the messages, I'm issuing vectored writes via io_uring_prep_writev(), where the iovec for each message points to the header and the body. In pseudocode:

char* FIRST_MSG_HEADER = "HEADER";
char* FISRT_MSG_BODY = "AAAA";

char* SECOND_MSG_HEADER = "HEADER";
char* SECOND_MSG_BODY = "BBBBBBBB";

// First message
struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
struct iovec first_msg_iov[2] = {};
first_msg_iov[0].iov_base = FIRST_MSG_HEADER;
first_msg_iov[0].iov_len = ...;
first_msg_iov[1].iov_base = FISRT_MSG_BODY;
first_msg_iov[1].iov_len = ...;
io_uring_prep_writev(sqe, fd, &first_msg_iov, 2, 0);
io_uring_set_data(sqe, &first_msg_iov);

// Second message
struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
struct iovec second_msg_iov[2] = {};
second_msg_iov[0].iov_base = SECOND_MSG_HEADER;
second_msg_iov[0].iov_len = ...;
second_msg_iov[1].iov_base = SECOND_MSG_BODY;
second_msg_iov[1].iov_len = ...;
io_uring_prep_writev(sqe, fd, &second_msg_iov, 2, 0);
io_uring_set_data(sqe, &second_msg_iov);

// Submit SQEs
// Assume iovecs and messages contents are not on stack and live until we get them back in CQEs
io_uring_submit(ring);

The pseudocode example does not call io_uring_enter() - let's say SQ_POLL is used (I'm not sure if it changes anything, but let's say it is used). Additional invariants:

On the receive side of this connection, let's say I'm issuing recv(). There are several possibilities that are fine:

// This is fine
recv() // returns HEADERAAAAHEADERBBBBBBBB

// This is also fine
recv() // returns HEADERBBBBBBBBHEADERAAAA

// This is also fine. Several short reads
recv() // returns HEA
recv() // returns DERAAA
recv() // returns AHEADERBBBBBBBB

// And this is also fine, again - short reads, but different order of messages
recv() // returns HEADERBB
recv() // returns BBBBBBHEADERAAAA

But there are possibilities that I'd call not fine:

// Not fine
recv() // returns HEADERHEADERBBBBBBBBAAAA

// Also not fine
recv() // returns HEADERHEADERAAAABBBBBBBB

If I understood Pavel correctly, the "not fine" possibilities may happen with io_uring. Is this correct?

I think io_uring behavior with writev SQEs makes it quite difficult to use, and also very non intuitive. Man pages do mention atomicity (on a process/level level only though, but sill...). From man writev:

The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).

Would it be possible to have more atomic behavior for writev SQEs in io_uring? Either by default or as an SQE flag.

gootorov commented 1 day ago

I'm looking at io_uring_prep_send() documentation:

  Both of the above send variants may be used with provided
  buffers, where rather than pass a buffer in directly with the
  request, IOSQE_BUFFER_SELECT is set in the SQE flags field, and
  additionally a buffer group ID is set in the SQE buf_group field.
  By using provided buffers with send requests, the application can
  prevent any kind of reordering of the outgoing data which can
  otherwise occur if the application has more than one send request
  inflight for a single socket. This provides better pipelining of
  data, where previously the app needed to manually serialize
  sends.

To double confirm - I copy "HEADER" into one buffer, and "AAAA" into another (both of these buffers belong to the same buffer group). This guarantees I'd get either "HEADERAAAAHEADERBBBBBBBB" or "HEADERBBBBBBBBHEADERAAAA" on the wire. Is this correct?

Also, once the kernel starts executing that send request, is the data copied from the provided buffer into some another socket buffer, or is it DMA'd to the NIC?

In other words, I'm curious if I'd be making two copies (my data into provided buffer, then provided buffer into socket buffer, then DMA) or only one (my data into provided buffer, then DMA).

Asmod4n commented 1 day ago

Since you are only issuing one submit, IOSQE_IO_LINK should help here.

isilence commented 1 day ago

first_msg_iov[0].iov_base = FIRST_MSG_HEADER; first_msg_iov[1].iov_base = FISRT_MSG_BODY;

FWIW, the way it's split into iovec changes nothing in terms of races, atomicity, etc.

// Not fine
recv() // returns HEADERHEADERBBBBBBBBAAAA

// Also not fine
recv() // returns HEADERHEADERAAAABBBBBBBB

If I understood Pavel correctly, the "not fine" possibilities may happen with io_uring. Is this correct?

Correct, it can happen, still the same answer. By default there is no ordering guarantee for multiple concurrent requests, data can get intermingled. You can even get something more bizzare like a "...ABABBA..." pattern.

I think io_uring behavior with writev SQEs makes it quite difficult to use, and also very non intuitive.

I believe it is intuitive though, you're queuing two racing concurrent requests, it should be thought as such.

Man pages do mention atomicity (on a process/level level only though, but sill...). From man writev:

The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).

I have serious doubts about that, and opening writev man I see "(but see pipe(7) for an exception)". As far as I remember the kernel's TCP stack, neither it's true in this case.

Would it be possible to have more atomic behavior for writev SQEs in io_uring? Either by default or as an SQE flag.

That can easily be done in the user space, it'll even be more efficient than submitting two requests. Yes, that would take a bit more code, but that's io_uring is low-level interface and that's up to upper layers / libraries to add niceness. That'll very likely be adding overhead for all users of io_uring, which is not a good idea when there are better alternatives.

You can also use provided buffers for ordering.

isilence commented 1 day ago

To double confirm - I copy "HEADER" into one buffer, and "AAAA" into another (both of these buffers belong to the same buffer group). This guarantees I'd get either "HEADERAAAAHEADERBBBBBBBB" or "HEADERBBBBBBBBHEADERAAAA" on the wire. Is this correct?

It doesn't order multiple requests with each other. With provided buffer you can push all your data (both "B" and "A") into a (single) provided ring in the order you wish it to be and queue just one request, which will try to send all the data in the order the data was added into the ring.

Also, once the kernel starts executing that send request, is the data copied from the provided buffer into some another socket buffer, or is it DMA'd to the NIC?

The data is copied, that's how the networking stack works. If you want to avoid copies you can look up IORING_OP_SEND_ZC or generic zero copy features Linux supports.

gootorov commented 1 day ago

first_msg_iov[0].iov_base = FIRST_MSG_HEADER; first_msg_iov[1].iov_base = FISRT_MSG_BODY;

FWIW, the way it's split into iovec changes nothing in terms of races, atomicity, etc.

True. I only wanted to point out that I wanted to use vectored I/O. Let me share a little bit more what I'm actually trying to achieve.

So, I'm actually working on a DPDK application. I receive raw packets from NIC, but then I'd like to use io_uring as a "send to another application" mechanism (which could be on the same host, or on the other) as a "DPDK bypass".

Apologies if you're already familiar with DPDK (nvm - just saw your and David's work on zerocopy RX receive, very interesting) but a short description - it's a kernel bypass. The packets are DMA'd directly into userspace memory (1G hugepages in my case). And there's a rx_burst() function I'm calling in a loop.

rx_burst() gives me a batch (typically up to 64) of mbuf*'s - which, for simplicity, let's say is just a pointer directly to Ethernet header (in reality, there's also a metadata block, but it doesn't matter too much).

The payload I'm extracting may be scattered across several mbufs (for example, if there's IP fragmentation). And I also need my own "Header". Vectored I/O seems like a very nice fit for this: the first element points to my header, then the next elements point to payload fragment(s), which are scattered in memory.

I then pass this io_vec, and only one copy is ever made - into the socket buffer (in my understanding).

In my naive view, I though it would be possible to combine this "burstiness" with io_uring's batching. Basically, my core application loop looks like so (in pseudocode):

while (true) {
    auto mbufs = rx_burst();
    for mbuf in mbufs {
        struct header* hdr = serialize_header(mbuf);
        struct io_vec* iov = build_iovec(mbuf, hdr);
        struct io_uring_sqe* = io_uring_get_sqe(ring);
        io_uring_prep_writev(sqe, fd, iov, iov_len, ...);
        io_uring_set_user_data(sqe, my_metadata_structure);
    }
    io_uring_submit();

    // process CQEs, this gives me back pointers to headers, to io_vecs, to mbufs, so I can return them to a memory pool
    io_uring_for_each_cqe(ring, head, cqe) {
        // process CQE
    }
}

There may be single fd I'm writing to, may be several I'm load balancing on. After your explanations, I can see why this approach wouldn't work. Perhaps I'm asking for too much :).

I believe it is intuitive though, you're queuing two racing concurrent requests, it should be thought as such.

It makes sense now, thank you for your explanations. This was a misunderstanding on my side - my initial impression of io_uring was - I put elements on a ring. The kernel takes them one by one and executes/schedules internally depending on device/socket/etc readiness. I didn't expect it would execute them concurrently and that it's possible to have racy writes.

That can easily be done in the user space, it'll even be more efficient than submitting two requests. Yes, that would take a bit more code, but that's io_uring is low-level interface and that's up to upper layers / libraries to add niceness.

Apologies if that's been answered before or if I'm not seeing it. But what's the condition for the kernel to post an CQE? In case of writing to a TCP socket specifically. Is it just finishing the copy into socket buffer, or is it getting an ACK? Or something else? I'm just afraid it would be very slow if it's an ACK that's required to post an CQE. One other thing I expected from the kernel - the writes I'm submitting vary in size greatly. Could be 60 bytes, could be 15k. In my understanding, the kernel would split the data in the socket buffer nicely depending on MTU. In my (data corrupting) initial attempt I saw TCP segments of up to 65k when traveling over loopback. I wouldn't get this of course, if the completion condition is an ACK.

I will try to serialize write requests from user space as you suggested. Thank you.

gootorov commented 12 hours ago

That can easily be done in the user space, it'll even be more efficient than submitting two requests. Yes, that would take a bit more code, but that's io_uring is low-level interface and that's up to upper layers / libraries to add niceness. That'll very likely be adding overhead for all users of io_uring, which is not a good idea when there are better alternatives.

One more thing. If I have multiple TCP connections, is my understanding correct that I can have multiple write requests concurrently, but no more than one write request per file descriptor?

isilence commented 7 hours ago

That can easily be done in the user space, it'll even be more efficient than submitting two requests. Yes, that would take a bit more code, but that's io_uring is low-level interface and that's up to upper layers / libraries to add niceness.

Apologies if that's been answered before or if I'm not seeing it. But what's the condition for the kernel to post an CQE? In case of writing to a TCP socket specifically. Is it just finishing the copy into socket buffer, or is it getting an ACK? Or something else?

It behaves the same way send(2) / etc. do. So in the common case the kernel will copy data into kernel allocated memory, and which point it returns back and reports the result. That data in kernel buffers will be sent some time later. IOW, no ACK waiting. That's the reason send CQEs are usually already posted by the time the submission part of the syscall ends.

Short writes are always possible as well, and likeliness of it depends protocol / etc. and flags you pass (NOWAIT, MSG_WAITALL, etc.).

I'm just afraid it would be very slow if it's an ACK that's required to post an CQE. One other thing I expected from the kernel - the writes I'm submitting vary in size greatly. Could be 60 bytes, could be 15k. In my understanding, the kernel would split the data in the socket buffer nicely depending on MTU. In my (data corrupting) initial attempt I saw TCP

The networking stack hides all these details for you. You can try zerocopy (IORING_OP_SEND[MSG]_ZC, MSG_ZEROCOPY), but then there might be more variance depending on your iov entry sizes. E.g. in this case the driver can get a bunch of partially filled pages, in which case DMA transfer can be somewhat more expensive.

isilence commented 7 hours ago

That can easily be done in the user space, it'll even be more efficient than submitting two requests. Yes, that would take a bit more code, but that's io_uring is low-level interface and that's up to upper layers / libraries to add niceness. That'll very likely be adding overhead for all users of io_uring, which is not a good idea when there are better alternatives.

One more thing. If I have multiple TCP connections, is my understanding correct that I can have multiple write requests concurrently, but no more than one write request per file descriptor?

It's perfectly allowed to have concurrent requests targeting different connections. For what it worth, it should be different connections, iow dup(2) wouldn't do.

Also FWIW, it might be ok to have concurrent writes if the file / socket / protocol supports it and gives some guarantees, though it'd need to care for io_uring not to internally retry a short write.