axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.9k stars 407 forks source link

Benchmark results for storage to SSD with/without liburing #1123

Open boxerab opened 7 months ago

boxerab commented 7 months ago

Hello, here are some benchmark results I have compiled for disk storage with/without liburing testing both buffered and direct IO.

Kernel: 6.7 file system: xfs disk : nvme ssd CPU: 48 thread AMD

The benchmark essentially takes a series of small buffers, does some work on each buffer, then stores the results to disk. fsync is called at the end. Unfortunately, liburing is slightly slower than blocking I/O

The benchmark project itself can be found here.

Note : 0 or 1 below represents false or true

Run with concurrency = 2, store to disk = 0, direct = 0, use uring = 0
 : 6213.034517 ms
Run with concurrency = 2, store to disk = 1, direct = 0, use uring = 0
 : 6194.979951 ms
Run with concurrency = 2, store to disk = 1, direct = 0, use uring = 1
 : 6635.473980 ms
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 0
 : 5723.335242 ms
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 1
 : 5159.495408 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 4, store to disk = 0, direct = 0, use uring = 0
 : 2668.155356 ms
Run with concurrency = 4, store to disk = 1, direct = 0, use uring = 0
 : 3611.529825 ms
Run with concurrency = 4, store to disk = 1, direct = 0, use uring = 1
 : 3696.223504 ms
Run with concurrency = 4, store to disk = 1, direct = 1, use uring = 0
 : 2901.635248 ms
Run with concurrency = 4, store to disk = 1, direct = 1, use uring = 1
 : 2663.072939 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 6, store to disk = 0, direct = 0, use uring = 0
 : 1723.227584 ms
Run with concurrency = 6, store to disk = 1, direct = 0, use uring = 0
 : 2666.758831 ms
Run with concurrency = 6, store to disk = 1, direct = 0, use uring = 1
 : 2632.156388 ms
Run with concurrency = 6, store to disk = 1, direct = 1, use uring = 0
 : 1981.299622 ms
Run with concurrency = 6, store to disk = 1, direct = 1, use uring = 1
 : 1807.754762 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 8, store to disk = 0, direct = 0, use uring = 0
 : 1330.179655 ms
Run with concurrency = 8, store to disk = 1, direct = 0, use uring = 0
 : 2095.117032 ms
Run with concurrency = 8, store to disk = 1, direct = 0, use uring = 1
 : 2122.957382 ms
Run with concurrency = 8, store to disk = 1, direct = 1, use uring = 0
 : 1523.161476 ms
Run with concurrency = 8, store to disk = 1, direct = 1, use uring = 1
 : 1367.067723 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 10, store to disk = 0, direct = 0, use uring = 0
 : 1061.972147 ms
Run with concurrency = 10, store to disk = 1, direct = 0, use uring = 0
 : 1899.936051 ms
Run with concurrency = 10, store to disk = 1, direct = 0, use uring = 1
 : 1877.989199 ms
Run with concurrency = 10, store to disk = 1, direct = 1, use uring = 0
 : 1244.832314 ms
Run with concurrency = 10, store to disk = 1, direct = 1, use uring = 1
 : 1099.402518 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 12, store to disk = 0, direct = 0, use uring = 0
 : 874.523286 ms
Run with concurrency = 12, store to disk = 1, direct = 0, use uring = 0
 : 1701.024333 ms
Run with concurrency = 12, store to disk = 1, direct = 0, use uring = 1
 : 1649.511243 ms
Run with concurrency = 12, store to disk = 1, direct = 1, use uring = 0
 : 1064.506254 ms
Run with concurrency = 12, store to disk = 1, direct = 1, use uring = 1
 : 932.916314 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 14, store to disk = 0, direct = 0, use uring = 0
 : 778.916140 ms
Run with concurrency = 14, store to disk = 1, direct = 0, use uring = 0
 : 1538.574320 ms
Run with concurrency = 14, store to disk = 1, direct = 0, use uring = 1
 : 1497.650033 ms
Run with concurrency = 14, store to disk = 1, direct = 1, use uring = 0
 : 946.505283 ms
Run with concurrency = 14, store to disk = 1, direct = 1, use uring = 1
 : 872.751521 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 16, store to disk = 0, direct = 0, use uring = 0
 : 669.623079 ms
Run with concurrency = 16, store to disk = 1, direct = 0, use uring = 0
 : 1424.179659 ms
Run with concurrency = 16, store to disk = 1, direct = 0, use uring = 1
 : 1458.929794 ms
Run with concurrency = 16, store to disk = 1, direct = 1, use uring = 0
 : 879.752639 ms
Run with concurrency = 16, store to disk = 1, direct = 1, use uring = 1
 : 987.317027 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 18, store to disk = 0, direct = 0, use uring = 0
 : 611.685649 ms
Run with concurrency = 18, store to disk = 1, direct = 0, use uring = 0
 : 2418.678168 ms
Run with concurrency = 18, store to disk = 1, direct = 0, use uring = 1
 : 1459.609819 ms
Run with concurrency = 18, store to disk = 1, direct = 1, use uring = 0
 : 865.681722 ms
Run with concurrency = 18, store to disk = 1, direct = 1, use uring = 1
 : 884.537595 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 20, store to disk = 0, direct = 0, use uring = 0
 : 572.201385 ms
Run with concurrency = 20, store to disk = 1, direct = 0, use uring = 0
 : 2631.481062 ms
Run with concurrency = 20, store to disk = 1, direct = 0, use uring = 1
 : 1380.879326 ms
Run with concurrency = 20, store to disk = 1, direct = 1, use uring = 0
 : 875.950846 ms
Run with concurrency = 20, store to disk = 1, direct = 1, use uring = 1
 : 872.418497 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 22, store to disk = 0, direct = 0, use uring = 0
 : 518.975461 ms
Run with concurrency = 22, store to disk = 1, direct = 0, use uring = 0
 : 1281.144535 ms
Run with concurrency = 22, store to disk = 1, direct = 0, use uring = 1
 : 1341.479703 ms
Run with concurrency = 22, store to disk = 1, direct = 1, use uring = 0
 : 861.209592 ms
Run with concurrency = 22, store to disk = 1, direct = 1, use uring = 1
 : 890.015725 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 24, store to disk = 0, direct = 0, use uring = 0
 : 486.925149 ms
Run with concurrency = 24, store to disk = 1, direct = 0, use uring = 0
 : 1262.528753 ms
Run with concurrency = 24, store to disk = 1, direct = 0, use uring = 1
 : 1267.014048 ms
Run with concurrency = 24, store to disk = 1, direct = 1, use uring = 0
 : 873.210672 ms
Run with concurrency = 24, store to disk = 1, direct = 1, use uring = 1
 : 908.068466 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 26, store to disk = 0, direct = 0, use uring = 0
 : 466.603771 ms
Run with concurrency = 26, store to disk = 1, direct = 0, use uring = 0
 : 1252.972816 ms
Run with concurrency = 26, store to disk = 1, direct = 0, use uring = 1
 : 1282.536363 ms
Run with concurrency = 26, store to disk = 1, direct = 1, use uring = 0
 : 858.094422 ms
Run with concurrency = 26, store to disk = 1, direct = 1, use uring = 1
 : 889.864092 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 28, store to disk = 0, direct = 0, use uring = 0
 : 454.630329 ms
Run with concurrency = 28, store to disk = 1, direct = 0, use uring = 0
 : 1201.286920 ms
Run with concurrency = 28, store to disk = 1, direct = 0, use uring = 1
 : 1232.661758 ms
Run with concurrency = 28, store to disk = 1, direct = 1, use uring = 0
 : 875.786470 ms
Run with concurrency = 28, store to disk = 1, direct = 1, use uring = 1
 : 922.097025 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 30, store to disk = 0, direct = 0, use uring = 0
 : 495.070279 ms
Run with concurrency = 30, store to disk = 1, direct = 0, use uring = 0
 : 1206.026119 ms
Run with concurrency = 30, store to disk = 1, direct = 0, use uring = 1
 : 1223.569305 ms
Run with concurrency = 30, store to disk = 1, direct = 1, use uring = 0
 : 874.067966 ms
Run with concurrency = 30, store to disk = 1, direct = 1, use uring = 1
 : 929.383958 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 32, store to disk = 0, direct = 0, use uring = 0
 : 435.855607 ms
Run with concurrency = 32, store to disk = 1, direct = 0, use uring = 0
 : 1216.080516 ms
Run with concurrency = 32, store to disk = 1, direct = 0, use uring = 1
 : 1223.707928 ms
Run with concurrency = 32, store to disk = 1, direct = 1, use uring = 0
 : 871.715323 ms
Run with concurrency = 32, store to disk = 1, direct = 1, use uring = 1
 : 936.795530 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 34, store to disk = 0, direct = 0, use uring = 0
 : 427.075910 ms
Run with concurrency = 34, store to disk = 1, direct = 0, use uring = 0
 : 1216.118478 ms
Run with concurrency = 34, store to disk = 1, direct = 0, use uring = 1
 : 1227.662183 ms
Run with concurrency = 34, store to disk = 1, direct = 1, use uring = 0
 : 870.027497 ms
Run with concurrency = 34, store to disk = 1, direct = 1, use uring = 1
 : 947.928060 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 36, store to disk = 0, direct = 0, use uring = 0
 : 436.021254 ms
Run with concurrency = 36, store to disk = 1, direct = 0, use uring = 0
 : 1222.277052 ms
Run with concurrency = 36, store to disk = 1, direct = 0, use uring = 1
 : 1226.627501 ms
Run with concurrency = 36, store to disk = 1, direct = 1, use uring = 0
 : 874.256454 ms
Run with concurrency = 36, store to disk = 1, direct = 1, use uring = 1
 : 949.608522 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 38, store to disk = 0, direct = 0, use uring = 0
 : 426.193849 ms
Run with concurrency = 38, store to disk = 1, direct = 0, use uring = 0
 : 1133.525446 ms
Run with concurrency = 38, store to disk = 1, direct = 0, use uring = 1
 : 1247.804088 ms
Run with concurrency = 38, store to disk = 1, direct = 1, use uring = 0
 : 878.324956 ms
Run with concurrency = 38, store to disk = 1, direct = 1, use uring = 1
 : 937.863404 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 40, store to disk = 0, direct = 0, use uring = 0
 : 431.698457 ms
Run with concurrency = 40, store to disk = 1, direct = 0, use uring = 0
 : 1131.224651 ms
Run with concurrency = 40, store to disk = 1, direct = 0, use uring = 1
 : 1246.924182 ms
Run with concurrency = 40, store to disk = 1, direct = 1, use uring = 0
 : 874.366975 ms
Run with concurrency = 40, store to disk = 1, direct = 1, use uring = 1
 : 947.108277 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 42, store to disk = 0, direct = 0, use uring = 0
 : 422.130728 ms
Run with concurrency = 42, store to disk = 1, direct = 0, use uring = 0
 : 1130.728607 ms
Run with concurrency = 42, store to disk = 1, direct = 0, use uring = 1
 : 1243.905110 ms
Run with concurrency = 42, store to disk = 1, direct = 1, use uring = 0
 : 897.637893 ms
Run with concurrency = 42, store to disk = 1, direct = 1, use uring = 1
 : 914.732763 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 44, store to disk = 0, direct = 0, use uring = 0
 : 420.690524 ms
Run with concurrency = 44, store to disk = 1, direct = 0, use uring = 0
 : 1275.922281 ms
Run with concurrency = 44, store to disk = 1, direct = 0, use uring = 1
 : 1285.821973 ms
Run with concurrency = 44, store to disk = 1, direct = 1, use uring = 0
 : 885.540083 ms
Run with concurrency = 44, store to disk = 1, direct = 1, use uring = 1
 : 997.991546 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 46, store to disk = 0, direct = 0, use uring = 0
 : 419.765191 ms
Run with concurrency = 46, store to disk = 1, direct = 0, use uring = 0
 : 1275.787164 ms
Run with concurrency = 46, store to disk = 1, direct = 0, use uring = 1
 : 1303.099823 ms
Run with concurrency = 46, store to disk = 1, direct = 1, use uring = 0
 : 891.274178 ms
Run with concurrency = 46, store to disk = 1, direct = 1, use uring = 1
 : 956.722336 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 48, store to disk = 0, direct = 0, use uring = 0
 : 417.375889 ms
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 0
 : 1305.577293 ms
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 1
 : 1437.136640 ms
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 0
 : 896.589395 ms
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 1
 : 999.392985 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
isilence commented 7 months ago

What is concurrency for non io_uring? Are you comparing 1 thread for io_uring vs multiple threads for the other one?

boxerab commented 7 months ago

What is concurrency for non io_uring? Are you comparing 1 thread for io_uring vs multiple threads for the other one?

0 or 1 represents false or true.

For example

Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 1
 : 999.392985 ms

represents 48 threads each working on small buffers and then queuing the buffers for storage with uring using O_DIRECT

isilence commented 7 months ago

So, let me confirm, "concurrency = 2, use uring = 1" means there are 2 threads, each thread has an io_uring instance, and each thread keep one request inflight, i.e. QD=1? Similar to this:

void thread_fn(ring) {
    while (1) {
        sqe = get_sqe();
        prep_write(sqe);
        io_uring_submit(ring);

        cqe = io_uring_wait();
        handle(cqe);
    }
}
boxerab commented 7 months ago

So, let me confirm, "concurrency = 2, use uring = 1" means there are 2 threads, each thread has an io_uring instance, and each thread keep one request inflight, i.e. QD=1? Similar to this:

void thread_fn(ring) {
  while (1) {
      sqe = get_sqe();
      prep_write(sqe);
      io_uring_submit(ring);

      cqe = io_uring_wait();
      handle(cqe);
  }
}

All serializing theads share the same uring queue

See code below: the first thread will create a queue, and all the others will share that queue.

bool FileIOUring::initQueue(uint32_t shared_ring_fd)
{
    if (shared_ring_fd){
        struct io_uring_params p;
        memset(&p, 0, sizeof(p));
        p.flags = IORING_SETUP_ATTACH_WQ;
        p.wq_fd = shared_ring_fd;
        int ret = io_uring_queue_init_params(QD, &ring, &p);
        if (ret < 0) {
            printf("io_uring_queue_init_params: %s\n", strerror(-ret));
            close();
            return false;
        }

    } else {
        int ret = io_uring_queue_init(QD, &ring, 0);
        if(ret < 0)
        {
            printf("io_uring_queue_init: %s\n", strerror(-ret));
            close();
            return false;
        }
    }

    return true;
}
isilence commented 7 months ago

Do you have a reproducer? It's not clear what the tool is doing, and without understanding that any performance reasoning would be futile.

When I said separate rings, it means there are separate struct io_uring, each separately initialised with io_uring_queue_init*(), regardless whether it's with IORING_SETUP_ATTACH_WQ set or not. Each such io_uring instance will have a separate submission/completion queue pair. IORING_SETUP_ATTACH_WQ does nothing about it and is a separate optimisation.

boxerab commented 7 months ago

Do you have a reproducer?

This is a moderately complex C++ project. If you're interested, I can share it with you.

axboe commented 7 months ago

Please do share it - even if it's complicated, just being able to run what you are running, tracing will often tell us a lot about how it's done without needing to fully read and comprehend the sample program.

boxerab commented 7 months ago

Great, here is the project.

axboe commented 7 months ago

And please include also how you are running it. The goal is to make this as trivial as possible for someone to reproduce :-)

boxerab commented 7 months ago

I've added an INSTALL file here that details how to run with default settings.

YoSTEALTH commented 7 months ago

It would be nice if simple benchmarks was added to example/benchmark this way other people can test to see how Liburing runs on their systems! It would also help with comparing different languages implementing Liburing as well.

isilence commented 7 months ago

@YoSTEALTH, not in liburing per se, but for storage there is fio/t/io_uring, someone may even adapt it to liburing and submit a patch

isilence commented 7 months ago

@boxerab, hopefully we'll find time to look at it, but I suspect the comparison is not apple to apples without even really looking at numbers. When you compare synchronous with asynchronous, it's usually either:

a) There are N threads in both cases, and both run QD1 per thread (i.e. one requests is executing in parallel per thread). In this case the asynchronous inteface basically runs synchronously, which is not good.

b) The asynchronous interface runs just 1 thread but QD=N, i.e. executes all requests in parallel. In this case the async interface may likely lose in throughput and/or latency, but the key is that it takes much less CPU.

And I don't understand which case is yours. There can be more options, a combination of previous two, or for instance N threads generate IO requests, send them to a single IO thread, which executes it via io_uring. But a lot would depend on how it actually implemented.

boxerab commented 7 months ago

b) The asynchronous interface runs just 1 thread but QD=N, i.e. executes all requests in parallel. In this case the async interface may likely lose in throughput and/or latency, but the key is that it takes much less CPU.

This is how the benchmark works, but QD=4 for all concurrency levels. Perhaps it should match the concurrency.

I will measure CPU usage for each configuration and see how that looks.

boxerab commented 7 months ago

Here are results comparing CPU usage for uring vs synchronous with O_DIRECT. Timing and CPU usage are both identical.

$ /usr/bin/time -v ./iobench  -c 48 -d -s
Run with concurrency = 48, store to disk = true, direct = true, use uring = false
 : 930.559021 ms
    Command being timed: "./iobench -c 48 -d -s"
    User time (seconds): 19.41
    System time (seconds): 0.99
    Percent of CPU this job got: 2104%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.97
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 170112
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 44275
    Voluntary context switches: 3386
    Involuntary context switches: 169
    Swaps: 0
    File system inputs: 8
    File system outputs: 5500888
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

$ /usr/bin/time -v ./iobench  -c 48 -d
Run with concurrency = 48, store to disk = true, direct = true, use uring = true
 : 939.254022 ms
    Command being timed: "./iobench -c 48 -d"
    User time (seconds): 17.72
    System time (seconds): 3.56
    Percent of CPU this job got: 2084%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.02
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1618560
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 405925
    Voluntary context switches: 22384
    Involuntary context switches: 6744
    Swaps: 0
    File system inputs: 8
    File system outputs: 5500888
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
zhengshuxin commented 1 month ago

I've found the same issue in https://github.com/axboe/liburing/issues/912 . And they havn't resolved it until now. I've just tested the same case with the newest liburing on Ubuntu with kernel6.8.5, and got the same bad results as before.

boxerab commented 1 month ago

I've found the same issue in #912 . And they havn't resolved it until now. I've just tested the same case with the newest liburing on Ubuntu with kernel6.8.5, and got the same bad results as before.

Glad to hear other people are testing this work flow. I hope this can be fixed eventually. With current situation it doesn't make sense for me to use uring in my application.

zhengshuxin commented 1 month ago

@zhengshuxin you should re-open this issue if still broken.

They closed the issue, and I thought they maybe have solved it. But when I tested it again, found the issue was sitll existing. I wrote a very simple demo to test it in https://github.com/acl-dev/demo/tree/master/c/file , anyone can use it to test the liburing's write performance for file IO.

zhengshuxin commented 1 month ago

I've added reading file comparing for sys read and liburing in https://github.com/acl-dev/demo/tree/master/c/file/main.c, and found that they'are similarly reading efficency. The comparing of read and write is below:

./file -n 100000
uring_write: open file.txt ok, fd=3
uring_write: write char=0
uring_write: write char=1
uring_write: write char=2
uring_write: write char=3
uring_write: write char=4
uring_write: write char=5
uring_write: write char=6
uring_write: write char=7
uring_write: write char=8
uring_write: write char=9
close file.txt ok, fd=3
uring write, total write=100000, cost=1541.28 ms, speed=64881.18
-------------------------------------------------------
sys_write: open file.txt ok, fd=3
sys_write: write char=0
sys_write: write char=1
sys_write: write char=2
sys_write: write char=3
sys_write: write char=4
sys_write: write char=5
sys_write: write char=6
sys_write: write char=7
sys_write: write char=8
sys_write: write char=9
close file.txt ok, fd=3
sys write, total write=100000, cost=80.58 ms, speed=1240925.73
========================================================
uring_read: read open file.txt ok, fd=3
uring_read: char[0]=0
uring_read: char[1]=1
uring_read: char[2]=2
uring_read: char[3]=3
uring_read: char[4]=4
uring_read: char[5]=5
uring_read: char[6]=6
uring_read: char[7]=7
uring_read: char[8]=8
uring_read: char[9]=9
close fd=3
uring read, total read=100000, cost=84.52 ms, speed=1183179.91
-------------------------------------------------------
sys_read: open file.txt ok, fd=3
sys_read: char[0]=0
sys_read: char[1]=1
sys_read: char[2]=2
sys_read: char[3]=3
sys_read: char[4]=4
sys_read: char[5]=5
sys_read: char[6]=6
sys_read: char[7]=7
sys_read: char[8]=8
sys_read: char[9]=9
sys read, total read=100000, cost=67.22 ms, speed=1487586.09
boxerab commented 1 month ago

I've added reading file comparing for sys read and liburing in https://github.com/acl-dev/demo/tree/master/c/file/main.c, and found that they'are similarly reading efficency. The comparing of read and write is below:

Great. It would be interesting to compare with running this test on xfs or btrfs.