axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.85k stars 402 forks source link

Read SQEs appear to be processed in the reverse order #1000

Closed andyg24 closed 11 months ago

andyg24 commented 11 months ago

I only have access to 6.5 kernel and liburing 2.4. Apologies if this issue has already been fixed in a later release.

When I queue a bunch of read SQEs to an NVME, they appear to be "processed" in the reverse order. I understand that this is an async API and there are no guarantees on the order in which operations complete, but on average I would expect that an operation that is queued first is also more likely to complete first.

Two scenarios demonstrated by the code below:

1) Submitting 100 sequential reads to a file. These are coalesced and processed atomically, but the CQEs always come back in reverse order (i.e. 99, 98, ... 0). I would expect the order to be 0 ... 99.

2) 100 reads that are a few blocks apart: (int request_spacing = 4096 * 3). These are not coalesced. These always complete starting from the last submitted SQEs:

cqe: 86 cqe: 87 cqe: 88 cqe: 89 cqe: 90 cqe: 91 cqe: 92 cqe: 93 cqe: 94 cqe: 95 cqe: 96 cqe: 97 cqe: 98 cqe: 99

The code for both cases (triggerable by request_spacing) is attached.

// dd if=/dev/zero of=test.dat bs=1M count=1
// gcc -o read_order read_order.cc -luring
// ./read_order

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>

#include "liburing.h"

int main() {
  struct io_uring ring;

  if (io_uring_queue_init(128, &ring, 0 /* flags */) != 0) {
    perror("io_uring_queue_init");
    exit(1);
  }

  int fd = open("test.dat", O_RDONLY | O_DIRECT, 0644);

  if (fd < 0) {
    perror("Run dd if=/dev/zero of=test.dat bs=1M count=1");
    exit(1);
  }

  char* buffer = (char*) aligned_alloc(4096, 128 * 4096);

  if (!buffer) {
    perror("aligned_alloc");
    exit(1);
  }

  // Prepare read requests
  // int request_spacing = 4096;
  int request_spacing = 4096 * 3;  // for non-sequential reads
  int num_requests = 100;

  struct io_uring_sqe *sqe;

  for (int i = 0; i < num_requests; i++) {
    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
      perror("io_uring_get_sqe");
      exit(1);
    }
    io_uring_prep_read(sqe, fd, buffer + 4096 * i,
                       4096 /* size */,
                       request_spacing * i /* file offset */);
    io_uring_sqe_set_data64(sqe, i);
  }

  // Submit requests and reap completions
  if (io_uring_submit_and_wait(&ring, 1) < 0) {
    perror("io_uring_submit_and_wait");
    exit(1);
  }

  struct io_uring_cqe *cqe;
  unsigned head;
  unsigned i = 0;

  io_uring_for_each_cqe(&ring, head, cqe) {
    printf("cqe: %lld\n", cqe->user_data);
    i++;
  }

  io_uring_cq_advance(&ring, i);

  io_uring_queue_exit(&ring);

  close(fd);
  free(buffer);

  return 0;
}
axboe commented 11 months ago

This is due to how completions are handled. It's just the posting of ring completions that can get reversed, depending on timing. Doesn't mean they are issued out-of-order (and in fact if they were, your coalescing problem would not have occurred).

andyg24 commented 11 months ago

If it's only the completions that are reversed, wouldn't I be more likely to see completions for requests 0..10 than 89..99 in the second case with request_spacing = 4096 * 3?

axboe commented 11 months ago

It depends on the frequency of which they come in, some of them are likely batched by the drive as they complete at the same time. I can absolutely guarantee you that it's only CQE posting that may be reversed. You can check if blktrace if you wish.

Since the sq side is a ring as well, they can only be retrieved and issued in the order in which they were added. Generally things will either complete or queue inline, for some cases we can punt to io-wq if we can't non-blocking issue. For those cases, issue may be reordered as well. But I'd be surprised if that's the case in your case. For O_DIRECT, everything should issue inline unless you run out of requests on the given nvme queue, and you'd need more pending IO for that to happen.

andyg24 commented 11 months ago

Thank you for the explanation. I will close for now and will reopen if I see any firm evidence that anything is submitted out of order.

I came across https://github.com/torvalds/linux/commit/34e0a279a993debaff03158fc2fbf6a00c093643, but I am guessing my 6.5 kernel already contains the patch.

andyg24 commented 11 months ago

Not super familiar with blktrace, but it does appear that requests are actually sent in reverse order to the driver:

259,0 4 310 0.000456102 3396 G R 546690944 + 8 [read_order] 259,0 4 312 0.000460408 3396 Q R 546690968 + 8 [read_order] 259,0 4 313 0.000460903 3396 G R 546690968 + 8 [read_order] 259,0 4 315 0.000464757 3396 Q R 546690992 + 8 [read_order] 259,0 4 316 0.000465256 3396 G R 546690992 + 8 [read_order] 259,0 4 318 0.000469170 3396 Q R 546691016 + 8 [read_order] 259,0 4 319 0.000469664 3396 G R 546691016 + 8 [read_order] 259,0 4 321 0.000473531 3396 Q R 546691040 + 8 [read_order] 259,0 4 322 0.000474020 3396 G R 546691040 + 8 [read_order] 259,0 4 324 0.000478126 3396 Q R 546691064 + 8 [read_order] 259,0 4 325 0.000478623 3396 G R 546691064 + 8 [read_order] 259,0 4 326 0.000496487 3396 D R 546691064 + 8 [read_order] 259,0 4 327 0.000496860 3396 D R 546691040 + 8 [read_order] 259,0 4 328 0.000497142 3396 D R 546691016 + 8 [read_order] 259,0 4 329 0.000497419 3396 D R 546690992 + 8 [read_order] 259,0 4 330 0.000497685 3396 D R 546690968 + 8 [read_order] 259,0 4 331 0.000497961 3396 D R 546690944 + 8 [read_order] 259,0 4 332 0.000498224 3396 D R 546690920 + 8 [read_order]

Note how the G and Q events (queuing) are in increasing order, while the D events are in decreasing order.

From the man page:

D ‐‐ issued A request that previously resided on the block layer queue or in the i/o scheduler has been sent to the driver.

Do I read this wrong? This is the read_order code above with request_spacing = 4096 * 3.

axboe commented 11 months ago

Here's a full trace on the current kernel: https://pastebin.com/icfMDfhx

I didn't dive into this yet, but an nvme device has a hook for dispatching a plug list directly. What happens when you queue a lot of IO is that a local per-task plug context is used, and the block layer adds requests to that. When it reaches a certain length, we flush it out before adding a new one. You can see that when user_data 0x20 is queued, that's 32 plugs in there already, and we push them out and you get a bunch of issue trace points. Since nvme is generally flash and doesn't care about ordering, they are dispatched in the opposite order in that plug. The commit you found is for drivers that don't have a queue_list handler.

That is just the issue side, completions come in in the order listed. The first completion we see is user_data 0x56 in that trace. As mentioned, this is processed via task_work, which again is also a LIFO list.

andyg24 commented 11 months ago

As always, thank you for your thorough and patient explanation.

Mind me asking what flags you used for that trace? It's a lot more readable than the default blktrace output.

I wonder why the kernel folks fixed ordering for drivers that don't have a queue_list handler but not for nvme. It's not completely true that flash doesn't care about ordering -- there is a lot of parallelism, but if some requests happen to land on the same die, they will be queued internally. Ideally, the order specified by the user should be respected all the way to the SSD, and no reversals should happen even within a plug.

Also, from a user's perspective, it would be best (at least most intuitive) if completions were returned in the order they came from the SSD. Have you considered using something like a ring buffer instead of a LIFO list for task_work?

axboe commented 11 months ago

Mind me asking what flags you used for that trace? It's a lot more readable than the default blktrace output.

Basically just turned on tracing for block and io_uring:

# echo 1 > /sys/kernel/debug/tracing/events/io_uring/enable
# echo 1 > /sys/kernel/debug/tracing/events/block/enable
[...]
run workload
cat /sys/kernel/debug/tracing/trace

I agree that nvme doesn't necessarily not care about ordering at all, but given that the re-ordering is only within a limited scope of the IO, probably not a huge concern. At least I haven't had anyone complain about it!

I also agree that it'd be nice to not necessarily reorder unnecessarily, it's only really done because of the data structures used. It's hard to do a lockless list that isn't LIFO. As completions reorder naturally with an async API anyway, I don't think it's a big deal. It's only if you start pondering about issue ordering that you would really notice or pay attention to it, and perhaps question what is going on.

andyg24 commented 11 months ago

It's just that the type of people who use io_uring are more likely to care about every last bit of performance. Sure, executing a batch of 32 requests on an SSD in reverse order isn't a huge deal, but if the fix is as simple as replacing rq_list_add() with rq_list_add_tail() as was done in https://github.com/torvalds/linux/commit/34e0a279a993debaff03158fc2fbf6a00c093643, perhaps worth the trouble?

axboe commented 11 months ago

We can't just use rq_list_add_tail(), that only works within the scope of that single function as it does all the list manipulation there. We could obviously add a list tail pointer instead and do it, at the cost of storing an extra pointer in the plug. Not sure that's worth it. One of the goals of sending a list of requests to issue on the nvme side is that we can fill in all the command slots, then ring the SQ doorbell on that side for the device to start them. So while we add them in the reverse order, the device gets all of them available at the same time anyway.

andyg24 commented 11 months ago

The device gets all of them available at the same time, but it will still prioritize earlier requests if there is a contention.

As I am not familiar with the code where request reversal happens, I will defer to your judgement that the cost may not be worth the benefit of keeping the requests in order. Closing the issue.

axboe commented 11 months ago

You could do something like this:

diff --git a/block/blk-core.c b/block/blk-core.c
index fdf25b8d6e78..8a71b3ac1aca 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1050,6 +1050,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
        return;

    plug->mq_list = NULL;
+   plug->mq_list_last = NULL;
    plug->cached_rq = NULL;
    plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
    plug->rq_count = 0;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 900c1be1fee1..bb0a18068ffb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1284,7 +1284,7 @@ static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)

 static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 {
-   struct request *last = rq_list_peek(&plug->mq_list);
+   struct request *last = plug->mq_list_last;

    if (!plug->rq_count) {
        trace_block_plug(rq->q);
@@ -1304,8 +1304,14 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
     */
    if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
        plug->has_elevator = true;
+
    rq->rq_next = NULL;
-   rq_list_add(&plug->mq_list, rq);
+   if (plug->mq_list_last) {
+       plug->mq_list_last->rq_next = rq;
+       plug->mq_list_last = rq;
+   } else {
+       plug->mq_list = plug->mq_list_last = rq;
+   }
    plug->rq_count++;
 }

@@ -2760,6 +2766,8 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
    struct request *rq;

+   plug->mq_list_last = NULL;
+
    /*
     * We may have been called recursively midway through handling
     * plug->mq_list via a schedule() in the driver's queue_rq() callback.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..dfcdda3a95a9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -962,6 +962,7 @@ void blk_mark_disk_dead(struct gendisk *disk);
  */
 struct blk_plug {
    struct request *mq_list; /* blk-mq requests */
+   struct request *mq_list_last;

    /* if ios_left is > 1, we can batch tag/rq allocations */
    struct request *cached_rq;
andyg24 commented 11 months ago

Thank you for doing this. It would be awesome if this patch could make it into the kernel.

andyg24 commented 11 months ago

This is a bit out of my depth but couldn't you use mq_lastp / rq_list_add_tail() for consistency with the other fix?

axboe commented 11 months ago

Yeah you could use that helper for this case too. The downside here, imho, isn't as much the added 8 bytes or the conditional branch that we now have, it's that we have to pull in the cacheline of the previous request when adding the next one. That's rather unfortunate...

Probably won't pursue this any further until such time where we have rotational drives on nvme or similar, unless other reasons pop up.

andyg24 commented 11 months ago

https://www.neowin.net/news/ocp-22-worlds-first-seagate-nvme-20-hard-disk-hdd-is-now-in-flesh-and-blood/

:)

axboe commented 11 months ago

Oh I know, this is why I brought it up. But so far I haven't seen any in production, and until that happens, it's just vapor ware imho.