axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.77k stars 398 forks source link

I/O response goes missing during a multi-threaded workload #998

Closed psarkar24 closed 9 months ago

psarkar24 commented 9 months ago

I'm using 2.1-2build1 with Linux kernel 5.15.0. I have a program that is multi-threaded though in reality there are only two threads working at any given point in time.

Each thread has its own ring and does the following to initialize the ring:

const int ret = io_uring_queue_init(maxevents, ring, 0);

Read I/O's are submitted using the following piece of code using param num_requests:

for (int64_t i = 0; i < num_requests; ++i) {
      struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
      // Check for sqe being NULL omitted 
      // io_uring_prev_readv/writev returns void
      // Need to support kernel 5.4.0 as well...
      io_uring_prep_readv(
            sqe, io_details->fd, io_details->iovec, 1, io_details->offset);
      // io_uring_sqe_set_data returns void
      io_uring_sqe_set_data(sqe, io_details);
      // io_uring_submit returns the number of sqes successfully submitted,
      // as we are submitting only 1 seq, the expected return value on
      // success is 1, else 0 on failure
      const int ret = io_uring_submit(ring);
      // Check for ret = 1 omitted
}

I/O completions are tracked using the following piece of code using two params num_requests and min_num_requests:

  for (int64_t i = 0; i < num_requests; i++) {
    struct io_uring_cqe *cqe;
    struct IoDetails *io_details;
    int ret = 0;
    // For the minumum number of requests, we wait till completion
    if (i < min_num_requests) {
      ret = io_uring_wait_cqe(ring, &cqe);
    } else {  // Beyond this, we do not wait, just return if something completed
      ret = io_uring_peek_cqe(ring, &cqe);
    }
    if (!ret) {
      io_details = (struct IoDetails *)io_uring_cqe_get_data(cqe);
      io_uring_cqe_seen(ring, cqe);
    } else {
      break;
   }
}

This works well for small files but for files >= 100 GB, an I/O goes missing every so often (once every 3000-10000 times). Here is an example where a read to offset 2951741440 is never responded to. In this run of the program, all invocations of io_uring_peek_cqe return -11. The log parameters are operation(ring_id, fd, offset, size)

io_uring_submit(139659743424736,98,2950692864,1048576)
io_uring_submit(139659743424736,98,2951741440,1048576)
io_uring_submit(139659743424736,98,2952790016,1048576)
io_uring_submit(139659743424736,98,2953838592,1048576)
io_uring_submit(139659743424736,98,2954887168,1048576)
io_uring_wait_cqe(139659743424736,98,2952790016,1048576)
io_uring_wait_cqe(139659743424736,98,2953838592,1048576)
io_uring_wait_cqe(139659743424736,98,2954887168,1048576)
io_uring_wait_cqe(139659743424736,98,2950692864,1048576)
io_uring_submit(139659743424736,98,2955935744,1048576)
io_uring_submit(139659743424736,98,2956984320,1048576)
io_uring_submit(139659743424736,98,2958032896,1048576)
io_uring_submit(139659743424736,98,2959081472,1048576)
io_uring_wait_cqe(139659743424736,98,2955935744,1048576)
io_uring_wait_cqe(139659743424736,98,2956984320,1048576)
io_uring_wait_cqe(139659743424736,98,2958032896,1048576)
io_uring_wait_cqe(139659743424736,98,2959081472,1048576)

Are there multi-threading issues I should be aware of. Even if threads have their own ring, is there any synchronization required? Could this be because of the underlying FS? Are there any debugging logs I can turn on?

axboe commented 9 months ago

If each thread is using its own ring, no there should not be any multi threading issues that you need to be aware of. The kernel side is always fine, it's just on the liburing side. But again, with separate rings, not really a concern.

Do you have a reproducer for this? I don't think it's a threading issue, but it could be a bug in the 5.15-stable kernel, potentially.

psarkar24 commented 9 months ago

I'll work on creating a reproducer and attach to this issue.

andyg24 commented 9 months ago

Unless I misunderstand something, OP's code doesn't reap completions correctly, as it only waits for min_num_requests to complete.

However, this doesn't explain why request 2951741440 is not reaped the next time io_uring_wait_cqe() is called.

psarkar24 commented 9 months ago

Unless I misunderstand something, OP's code doesn't reap completions correctly, as it only waits for min_num_requests to complete.

Could you please clarify your concern? The intent is to wait for min_num_requests only and anything beyond is speculative using io_uring_peek_cqe. Is the call to io_uring_peek_cqe incorrect?

andyg24 commented 9 months ago

io_uring_peek_cqe() doesn't enter the kernel and only fetches the completions that were already added to the ring buffer by the last io_uring_wait_cqe().

Does you code work OK when you set min_num_requests to a very large number?

As written, your code should eventually overflow the CQ buffer, as you leave some completions behind with each iteration.

psarkar24 commented 9 months ago

I have not provided the entire code (which was primarily presented to highlight threading issues ) but to give a summary of the test driver, max events = 8, num_requests = 8 and min_num_requests = 4. Reading a large file involves 1000s of requests and the test driver breaks them into chunks based on the parameters above.

The invariant is that the test driver keeps track and issues as many io_uring_wait_cqe requests as there are io_uring_submit requests. Also if any of the io_uring_peek_cqe requests are successful, then the number of io_uring_wait_cqe requests are deducted accordingly. I ljust finished a 100 GB read with 1 MB IO size and just 3 missing IOs. So based on the assertion that all requests are accounted for, and from experimental data, I don’t see how there could be a CQE overflow.

Hope this helps clarify your concerns.

andyg24 commented 9 months ago

Got it. What you describe is different from the snippets in your original message. Happy to look at your working example once you have it.

psarkar24 commented 9 months ago

I'm unable to reproduce this outside of the environment where I am observing this. Will reopen this when I have more data, closing this issue for now.