axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.81k stars 400 forks source link

zero-copy send with IOSQE_CQE_SKIP_SUCCESS is broken #824

Open illiliti opened 1 year ago

illiliti commented 1 year ago

I noticed that zero-copy send can't be used with IOSQE_CQE_SKIP_SUCCESS. It took quite a while to find that out(i thought the problem was on my end), so I think it should be documented at least.

isilence commented 1 year ago

It's rather not yet supported, I need to add and agree that mans should be adjusted in the meanwhile.

Btw, let me ask what kind of behaviour you was expecting from it considering that there are two CQEs including notifications?

illiliti commented 1 year ago

I was expecting no CQEs at all. I'm sending static buffer, I don't need notifications.

bnbajwa commented 1 year ago

Agreed with illiliti, if I used IOSQE_CQE_SKIP_SUCCESS with sendzc, I imagine it would only be for buffers that remain "alive" for the entire duration of the program i.e. so I would not be freeing them upon notif CQEs.

More precisely, if IOSQE_CQE_SKIP_SUCCESS is set: (first CQE) of sendzc should only show up if it failed or if it was a short send (second CQE i.e. notif CQE) should never show up with IOSQE_CQE_SKIP_SUCCESS.

isilence commented 1 year ago

It doesn't seem viable apart from benchmarks and highly specific cases. The user should not be rewriting buffers before getting a notification, so it needs a way to know that the buffer is freed. My plan is to:

1) Make IOSQE_CQE_SKIP_SUCCESS to affect only the first CQE according to the generic rules. That gives consistency. 2) Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

And allow to freely mix them. How does it sound?

illiliti commented 1 year ago

The buffer is constant. An example of such buffer is a file #embeded at compile time or mmaped at program startup. It never changes, it needs no way to know if it's freed, thus it need no (from application/user perspective)useless notifications about its state.

@bnbajwa proposal is exactly what is needed, basically make IOSQE_CQE_SKIP_SUCCESS work as it work for normal non-zerocopy send + discard second CQE. Your proposal void the whole point of IOSQE_CQE_SKIP_SUCCESS because user would still have to handle CQE.

bnbajwa commented 1 year ago
  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

And allow to freely mix them. How does it sound?

If IOSQE_CQE_SKIP_SUCCESS and the send-zc flag can be freely mixed together, why does the send-zc flag need to implicitly skip the completion (1st) CQE? send-zc flag should just copy the return code from completion CQE into the notification CQE — it should not automatically skip the first CQE. If the user wants to to skip the completion (1st) CQE, then they can just add the IOSQE_CQE_SKIP_SUCCESS flag too (i.e. IOSQE_CQE_SKIP_SUCCESS + send-zc).

send-zc flag would be useful, it would simplify my TCP sendzc implementation. This is because — currently — when we get a notif CQE, we can't be sure if we should free the buffer or not as we can get notif CQEs even on failures and short sends — in which case I will want to retry sendzc request with same buffer. So knowing the return code in the notif CQE is important. I've implemented this in userspace by simply copying the return code from 1st CQE into allocated memory which the notif CQE then reads from. So, its simple to implement in userspace, thus, the flag itself is not essential but would be nice.

bnbajwa commented 1 year ago

The user should not be rewriting buffers before getting a notification, so it needs a way to know that the buffer is freed.

As with @illiliti, my imagined use case would also be for constant buffers that are never modified e.g. a loaded file in memory.

  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

I've been thinking about this further, is your proposal that the completion (1st) CQE would always be skipped with this flag even for failures/short sends? So essentially, it would turn sendzc from a 2 CQE request into an always 1 CQE request? If so, seems to be reasonable but not sure if it would be widely applicable.

It doesn't seem viable apart from benchmarks and highly specific cases.

I don't think any of the proposals in this thread for IOSQE_CQE_SKIP_SUCCESS are general enough to be honest. They are maybe applicable to quite specific cases only.

Make IOSQE_CQE_SKIP_SUCCESS to affect only the first CQE according to the generic rules. That gives consistency.

I think IOSQE_CQE_SKIP_SUCCESS affecting only the first CQE seems fine/reasonable. However, I think we do need a flag to allow the 2nd notif CQE to be skipped (always?) as well.

Being able to control exactly which CQEs to skip seems like a more general proposal. i.e. we should be able to skip any of:

For example, suppose I had a constant buffer loaded in memory that I wanted to send in chunks. I would want to skip second CQE because I won't be modifying the buffer. However, I would want to receive the first CQE so I immediately know when I can send the next chunk (high throughput).

isilence commented 1 year ago

The buffer is constant. An example of such buffer is a file #embeded at compile time or mmaped at program startup. It never changes, it needs no way to know if it's freed, thus it need no (from application/user perspective)useless notifications about its state.

Again, that's too limited of a use case and more seems like a benchmark to me. We're not going to cater it to one very specific case.

@bnbajwa proposal is exactly what is needed, basically make IOSQE_CQE_SKIP_SUCCESS work as it work for normal non-zerocopy send + discard second CQE. Your proposal void the whole point of IOSQE_CQE_SKIP_SUCCESS because user would still have to handle CQE.

How so? Would you elaborate? Set two of them and you get a usual IOSQE_CQE_SKIP_SUCCESS behaviour, one CQE on failure or none if succeed, which is, to be honest, pretty useless unless linked with other requests.

isilence commented 1 year ago
  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

And allow to freely mix them. How does it sound?

If IOSQE_CQE_SKIP_SUCCESS and the send-zc flag can be freely mixed together, why does the send-zc flag need to implicitly skip the completion (1st) CQE? send-zc flag should just copy the return code from completion CQE into the notification CQE — it should not automatically skip the first CQE. If the user wants to to skip the completion (1st) CQE, then they can just add the IOSQE_CQE_SKIP_SUCCESS flag too (i.e. IOSQE_CQE_SKIP_SUCCESS + send-zc).

It seems you don't understand the idea. You can also think of it as combining two CQEs into one, which will be posted at a time when previously the notification would've been posted.

send-zc flag would be useful, it would simplify my TCP sendzc implementation. This is because — currently — when we get a notif CQE, we can't be sure if we should free the buffer or not as we can get notif CQEs even on failures and short sends — in which case I will want to retry sendzc request with same buffer. So knowing the return code in the notif CQE is important. I've implemented this in userspace by simply copying the return code from 1st CQE into allocated memory which the notif CQE then reads from. So, its simple to implement in userspace, thus, the flag itself is not essential but would be nice.

A notification CQE is posted strictly after the corresponding completion CQEs. If it's not the case, please report a bug.

isilence commented 1 year ago

The user should not be rewriting buffers before getting a notification, so it needs a way to know that the buffer is freed.

As with @illiliti, my imagined use case would also be for constant buffers that are never modified e.g. a loaded file in memory.

  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

I've been thinking about this further, is your proposal that the completion (1st) CQE would always be skipped with this flag even for failures/short sends? So essentially, it would turn sendzc from a 2 CQE request into an always 1 CQE request? If so, seems to be reasonable but not sure if it would be widely applicable.

No, the idea is to combine two into one, and then you can apply generic IOSQE_CQE_SKIP_SUCCESS rules on top of that one, i.e. CQE on error or none otherwise.

It doesn't seem viable apart from benchmarks and highly specific cases.

I don't think any of the proposals in this thread for IOSQE_CQE_SKIP_SUCCESS are general enough to be honest. They are maybe applicable to quite specific cases only.

Make IOSQE_CQE_SKIP_SUCCESS to affect only the first CQE according to the generic rules. That gives consistency.

I think IOSQE_CQE_SKIP_SUCCESS affecting only the first CQE seems fine/reasonable. However, I think we do need a flag to allow the 2nd notif CQE to be skipped (always?) as well.

Being able to control exactly which CQEs to skip seems like a more general proposal. i.e. we should be able to skip any of:

* first CQE only

* second CQE only

* both CQEs

For example, suppose I had a constant buffer loaded in memory that I wanted to send in chunks. I would want to skip second CQE because I won't be modifying the buffer. However, I would want to receive the first CQE so I immediately know when I can send the next chunk (high throughput).

In general case the user need to do sth with the buffer afterward (free / reuse) and so should have a way to infer the state of the buffer. otherwise it can't be reused or even normally freed but only munmap()'ed at the end. Just skipping is not going to work.

pyhd commented 2 weeks ago

@isilence Excuse me, how's the progress? Because any error might lead wait(nr) for send_zc into deadlock, a single cqe is still needed to fit wait(nr).

isilence commented 2 weeks ago

I drafted something similar some time ago, can rebase and merge, but I wonder if it even fits your app, and that depends on your workload pattern. Notifications come much later than main completions, and by merging them together you'll get a big latency for getting the result back. Do you use TCP? Does your program looks like send(); recv(); send(); recv() ... ? Or can you send multiple times before getting anything back?

With how it is now, you can wait (in terms how you choose that "nr") in two steps, first for the main completion and then the notification. Even more, you don't want to eagerly wait for the notifications unless you have a memory pressure situation.

compl_expected = 0;
notif_expected = 0;

while (1) {
    while (send_more()) {
        queue_send_zc();
        compl_expected++;
    }

    to_wait = compl_expected;
    if (memory_pressure)
        to_wait += notif_expected;
    wait_cqes(nr=to_wait);

    while (cqe = peek_cqe()) {
        compl_expected--;

        if (is_notification(cqe)) {
            notif_expected--;
        } else { // sendzc
            if (cqe->flags & F_MORE) {
                compl_expected++;
                notif_expected++;
            }
        }
    }
}
pyhd commented 2 weeks ago

Notifications come much later than main completions, and by merging them together you'll get a big latency for getting the result back. Do you use TCP? Does your program looks like send(); recv(); send(); recv() ... ? Or can you send multiple times before getting anything back?

If non-zc wants to skip the completion cqe, why zc not? Personally notification cqes are just used to reclaim buffers if necessary. Also it is almost a free lunch for UDP. So IMHO it is just a complement, meanwhile it will be another enhanced variant to skip the notification cqe.

With how it is now, you can wait (in terms how you choose that "nr") in two steps, first for the main completion and then the notification. Even more, you don't want to eagerly wait for the notifications unless you have a memory pressure situation.

Still have to collect at least twice. My temporary workaround is io_uring_submit_and_wait_timeout(nr*2), so mostly no need to worry about rare error cases. However wait(nr) could be preferable for precise reclaim.

isilence commented 1 week ago

Notifications come much later than main completions, and by merging them together you'll get a big latency for getting the result back. Do you use TCP? Does your program looks like send(); recv(); send(); recv() ... ? Or can you send multiple times before getting anything back?

If non-zc wants to skip the completion cqe, why zc not? Personally notification cqes are just used to reclaim buffers if necessary. Also it is almost a free lunch for UDP. So IMHO it is just a complement, meanwhile it will be another enhanced variant to skip the notification cqe.

Right, which is why it's a separate question on your use case and it's for outlining pitfalls.

With how it is now, you can wait (in terms how you choose that "nr") in two steps, first for the main completion and then the notification. Even more, you don't want to eagerly wait for the notifications unless you have a memory pressure situation.

Still have to collect at least twice. My temporary workaround is io_uring_submit_and_wait_timeout(nr*2), so mostly no need to worry about rare error cases. However wait(nr) could be preferable for precise reclaim.

Exactly why I'm asking about your use case. First it's not right as you're not guaranteed that it'll post 2 CQEs, e.g. when something bad happens early on like an allocation failure, so that can wait indefinitely. And also that could be a MASSIVE performance / latency problem depending on the use case. Users generally don't/shouldn't want to eagerly wait for notifications unless you have pressure on the memory front built up, and I've seen people falling into similar traps and finding performance being miserable.

I'd wait for nr, i.e. one CQE per send, but because that wait might have left more than nr CQEs in the ring, I'd try to opportunistically reap more CQEs.

io_uring_wait_cqe(nr);
io_uring_for_each_cqe(ring, head, cqe) {
    handle_cqe();
}
io_uring_cq_advance(ring, nr_cqes_actually_processed);