Open boxerab opened 7 months ago
What is concurrency for non io_uring? Are you comparing 1 thread for io_uring vs multiple threads for the other one?
What is concurrency for non io_uring? Are you comparing 1 thread for io_uring vs multiple threads for the other one?
0 or 1 represents false or true.
For example
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 1
: 999.392985 ms
represents 48 threads each working on small buffers and then queuing the buffers for storage with uring
using O_DIRECT
So, let me confirm, "concurrency = 2, use uring = 1" means there are 2 threads, each thread has an io_uring instance, and each thread keep one request inflight, i.e. QD=1? Similar to this:
void thread_fn(ring) {
while (1) {
sqe = get_sqe();
prep_write(sqe);
io_uring_submit(ring);
cqe = io_uring_wait();
handle(cqe);
}
}
So, let me confirm, "concurrency = 2, use uring = 1" means there are 2 threads, each thread has an io_uring instance, and each thread keep one request inflight, i.e. QD=1? Similar to this:
void thread_fn(ring) { while (1) { sqe = get_sqe(); prep_write(sqe); io_uring_submit(ring); cqe = io_uring_wait(); handle(cqe); } }
All serializing theads share the same uring
queue
See code below: the first thread will create a queue, and all the others will share that queue.
bool FileIOUring::initQueue(uint32_t shared_ring_fd)
{
if (shared_ring_fd){
struct io_uring_params p;
memset(&p, 0, sizeof(p));
p.flags = IORING_SETUP_ATTACH_WQ;
p.wq_fd = shared_ring_fd;
int ret = io_uring_queue_init_params(QD, &ring, &p);
if (ret < 0) {
printf("io_uring_queue_init_params: %s\n", strerror(-ret));
close();
return false;
}
} else {
int ret = io_uring_queue_init(QD, &ring, 0);
if(ret < 0)
{
printf("io_uring_queue_init: %s\n", strerror(-ret));
close();
return false;
}
}
return true;
}
Do you have a reproducer? It's not clear what the tool is doing, and without understanding that any performance reasoning would be futile.
When I said separate rings, it means there are separate struct io_uring, each separately initialised with io_uring_queue_init*()
, regardless whether it's with IORING_SETUP_ATTACH_WQ
set or not. Each such io_uring instance will have a separate submission/completion queue pair. IORING_SETUP_ATTACH_WQ
does nothing about it and is a separate optimisation.
Do you have a reproducer?
This is a moderately complex C++ project. If you're interested, I can share it with you.
Please do share it - even if it's complicated, just being able to run what you are running, tracing will often tell us a lot about how it's done without needing to fully read and comprehend the sample program.
Great, here is the project.
And please include also how you are running it. The goal is to make this as trivial as possible for someone to reproduce :-)
I've added an INSTALL file here that details how to run with default settings.
It would be nice if simple benchmarks was added to example/benchmark
this way other people can test to see how Liburing runs on their systems! It would also help with comparing different languages implementing Liburing as well.
@YoSTEALTH, not in liburing per se, but for storage there is fio/t/io_uring, someone may even adapt it to liburing and submit a patch
@boxerab, hopefully we'll find time to look at it, but I suspect the comparison is not apple to apples without even really looking at numbers. When you compare synchronous with asynchronous, it's usually either:
a) There are N threads in both cases, and both run QD1 per thread (i.e. one requests is executing in parallel per thread). In this case the asynchronous inteface basically runs synchronously, which is not good.
b) The asynchronous interface runs just 1 thread but QD=N, i.e. executes all requests in parallel. In this case the async interface may likely lose in throughput and/or latency, but the key is that it takes much less CPU.
And I don't understand which case is yours. There can be more options, a combination of previous two, or for instance N threads generate IO requests, send them to a single IO thread, which executes it via io_uring. But a lot would depend on how it actually implemented.
b) The asynchronous interface runs just 1 thread but QD=N, i.e. executes all requests in parallel. In this case the async interface may likely lose in throughput and/or latency, but the key is that it takes much less CPU.
This is how the benchmark works, but QD=4 for all concurrency levels. Perhaps it should match the concurrency.
I will measure CPU usage for each configuration and see how that looks.
Here are results comparing CPU usage for uring vs synchronous with O_DIRECT. Timing and CPU usage are both identical.
$ /usr/bin/time -v ./iobench -c 48 -d -s
Run with concurrency = 48, store to disk = true, direct = true, use uring = false
: 930.559021 ms
Command being timed: "./iobench -c 48 -d -s"
User time (seconds): 19.41
System time (seconds): 0.99
Percent of CPU this job got: 2104%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 170112
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 44275
Voluntary context switches: 3386
Involuntary context switches: 169
Swaps: 0
File system inputs: 8
File system outputs: 5500888
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ /usr/bin/time -v ./iobench -c 48 -d
Run with concurrency = 48, store to disk = true, direct = true, use uring = true
: 939.254022 ms
Command being timed: "./iobench -c 48 -d"
User time (seconds): 17.72
System time (seconds): 3.56
Percent of CPU this job got: 2084%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1618560
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 405925
Voluntary context switches: 22384
Involuntary context switches: 6744
Swaps: 0
File system inputs: 8
File system outputs: 5500888
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I've found the same issue in https://github.com/axboe/liburing/issues/912 . And they havn't resolved it until now. I've just tested the same case with the newest liburing on Ubuntu with kernel6.8.5, and got the same bad results as before.
I've found the same issue in #912 . And they havn't resolved it until now. I've just tested the same case with the newest liburing on Ubuntu with kernel6.8.5, and got the same bad results as before.
Glad to hear other people are testing this work flow. I hope this can be fixed eventually. With current situation it doesn't make sense for me to use uring in my application.
@zhengshuxin you should re-open this issue if still broken.
They closed the issue, and I thought they maybe have solved it. But when I tested it again, found the issue was sitll existing. I wrote a very simple demo to test it in https://github.com/acl-dev/demo/tree/master/c/file , anyone can use it to test the liburing's write performance for file IO.
I've added reading file comparing for sys read and liburing in https://github.com/acl-dev/demo/tree/master/c/file/main.c, and found that they'are similarly reading efficency. The comparing of read and write is below:
./file -n 100000
uring_write: open file.txt ok, fd=3
uring_write: write char=0
uring_write: write char=1
uring_write: write char=2
uring_write: write char=3
uring_write: write char=4
uring_write: write char=5
uring_write: write char=6
uring_write: write char=7
uring_write: write char=8
uring_write: write char=9
close file.txt ok, fd=3
uring write, total write=100000, cost=1541.28 ms, speed=64881.18
-------------------------------------------------------
sys_write: open file.txt ok, fd=3
sys_write: write char=0
sys_write: write char=1
sys_write: write char=2
sys_write: write char=3
sys_write: write char=4
sys_write: write char=5
sys_write: write char=6
sys_write: write char=7
sys_write: write char=8
sys_write: write char=9
close file.txt ok, fd=3
sys write, total write=100000, cost=80.58 ms, speed=1240925.73
========================================================
uring_read: read open file.txt ok, fd=3
uring_read: char[0]=0
uring_read: char[1]=1
uring_read: char[2]=2
uring_read: char[3]=3
uring_read: char[4]=4
uring_read: char[5]=5
uring_read: char[6]=6
uring_read: char[7]=7
uring_read: char[8]=8
uring_read: char[9]=9
close fd=3
uring read, total read=100000, cost=84.52 ms, speed=1183179.91
-------------------------------------------------------
sys_read: open file.txt ok, fd=3
sys_read: char[0]=0
sys_read: char[1]=1
sys_read: char[2]=2
sys_read: char[3]=3
sys_read: char[4]=4
sys_read: char[5]=5
sys_read: char[6]=6
sys_read: char[7]=7
sys_read: char[8]=8
sys_read: char[9]=9
sys read, total read=100000, cost=67.22 ms, speed=1487586.09
I've added reading file comparing for sys read and liburing in https://github.com/acl-dev/demo/tree/master/c/file/main.c, and found that they'are similarly reading efficency. The comparing of read and write is below:
Great. It would be interesting to compare with running this test on xfs or btrfs.
Hello, here are some benchmark results I have compiled for disk storage with/without liburing testing both buffered and direct IO.
Kernel: 6.7 file system: xfs disk : nvme ssd CPU: 48 thread AMD
The benchmark essentially takes a series of small buffers, does some work on each buffer, then stores the results to disk.
fsync
is called at the end. Unfortunately, liburing is slightly slower than blocking I/OThe benchmark project itself can be found here.
Note : 0 or 1 below represents false or true