Open JackKelly opened 8 months ago
This is for the code in main
at commit ef8c7b7d564ddf1dd9ef68240dc52ebef228d4a0.
The majority of time (the wide "mountain" in the middle of this flamegraph) is spent in light_speed_io::io_uring_local::worker_thread_func
. In turn, the functions which make up most of the time in worker_thread_func
are (in order, the longest-running first):
io_cqring_wait
(this is the longest-running function by some margin)light_speed_io::Operation::to_iouring_entry
.io_submit_sqes
So, I think the priority is #49.
If we zoom into light_speed_io::Operation::to_iouring_entry
, we can see the relative importance of these improvements:
Big breakthrough: Today, I figured out that I was doing something stupid! TL;DR: We're now getting throughput up to 960 MiB/s (up from about 220 MiB/s!) (i.e. better than a 4x speedup!).
LSIO now compares very favorably against fio
and object_store
(for reading 1,000 files, each file is 256 kB, on my old Intel NUC box). fio
gets, at best, about 900 MiB/s. object_store::LocalFileSystem::get
gets about 250 MiB/s! :slightly_smiling_face:
What I had forgotten is that, in Rust, an async
function isn't polled until we call await
on the Future
returned by the function. So we weren't actually submitting multiple reads concurrently! There was only ever one operation in flight in io_uring at any one time.
This was fixed by changing async fn get
to fn get
, and returning a Box::pin(async {...})
.
New flamegraph:
I just built an AMD Epyc workstation with two PCIe5 SSDs: one for the OS, one just for benchmarking.
Running cargo bench
gives a nasty surprise!
Running benches/get.rs (target/release/deps/get-766f6439cf0e228e)
get_1000_whole_files/uring_get
time: [118.45 ms 124.76 ms 131.25 ms]
thrpt: [1.8601 GiB/s 1.9568 GiB/s 2.0611 GiB/s]
change:
time: [-9.4570% -3.5549% +2.7355%] (p = 0.27 > 0.05)
thrpt: [-2.6627% +3.6859% +10.445%]
No change in performance detected.
Benchmarking get_1000_whole_files/local_file_system_get: Warming up for 2.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.2s or enable flat sampling.
get_1000_whole_files/local_file_system_get
time: [31.853 ms 32.297 ms 33.342 ms]
thrpt: [7.3223 GiB/s 7.5592 GiB/s 7.6647 GiB/s]
change:
time: [-10.750% +0.5216% +13.785%] (p = 0.95 > 0.05)
thrpt: [-12.115% -0.5189% +12.045%]
No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
get_16384_bytes_from_1000_files/uring_get_range
time: [22.219 ms 22.424 ms 22.736 ms]
thrpt: [687.24 MiB/s 696.79 MiB/s 703.24 MiB/s]
change:
time: [-3.7240% -0.8606% +1.7768%] (p = 0.59 > 0.05)
thrpt: [-1.7457% +0.8681% +3.8681%]
No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high mild
get_16384_bytes_from_1000_files/local_file_system_get_range
time: [8.5492 ms 8.6767 ms 8.9215 ms]
thrpt: [1.7103 GiB/s 1.7586 GiB/s 1.7848 GiB/s]
change:
time: [-13.011% +1.2443% +18.663%] (p = 0.89 > 0.05)
thrpt: [-15.728% -1.2291% +14.957%]
No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
My io_uring code is quite a bit slower than the equivalent object_store
code.
AFAICT, a problem with my io_uring code is that it fails to keep the OS IO queue topped up. Running iostat -xm --pretty 1 -p /dev/nvme0n1
(and looking at the aqu-sz
column) shows that, when the benchmark get_1000_whole_files/uring_get
is running, the IO queue is only between 1 and 2. But when the object_store bench is running, the IO queue is more like 120!
I think the solution is to stop using fixed files in io_uring, which then allows me to have more than 16 files in flight at any one time. And/or perhaps the solution is #75.
That said, fio
still achieves 5.3 MiB/s with an IO depth of 1.
fio
experiments:Base config: nrfiles=1000, filesize=256Ki, iodepth=1, ioengine=io_uring, readwrite=read, direct=0, blocksize=256Ki
: 1.5 GiB/s
direct=0, iodepth=16
: 1.8 GiB/s (and aqu-sz
stays around 0.4)direct=0, iodepth=128
: 1.8 GiB/s (and aqu-sz
stays around 0.4)direct=0, iodepth=16, fixedbufs=0, registerfiles=1, sqthreadpoll=0
: 1.8 GiB/s (aqu-sz
gets to 0.4)direct=0, iodepth=16, fixedbufs=0, registerfiles=1, sqthreadpoll=1
: 2.4 GiB/s (aqu-sz
gets to 1.2)direct=0, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1
: 2.4 GiB/s (aqu-sz
gets to 1.2)direct=0, iodepth=16, numjobs=4
: 6.8 GiB/sdirect=1
: 4.1 GiB/s (aqu-sz
1.05)direct=1, iodepth=16
: 9.0 GiB/s (aqu-sz
17)direct=1, iodepth=16, numjobs=4
: 11.2 GiB/s (aqu-sz
116)direct=1, iodepth=16, sqthread_poll=1
: 10.7 GiB/s direct=1, iodepth=16, fixedbufs=1
: 10.1 GiB/sdirect=1, iodepth=16, fixedbufs=1, registerfiles=1
: 10.9 GiB/sdirect=1, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1
: 10.9 GiB/sdirect=1, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=4
: 11.2 GiB/s (12 GB/s)Base config: nrfiles=1000, filesize=256Ki, iodepth=1, ioengine=io_uring, readwrite=randread, direct=0, blocksize=4Ki
: 86 MiB/s
direct=1
: 89 MiB/sdirect=1, iodepth=16
: 758 MiB/sdirect=1, iodepth=128
: 769 MiB/sdirect=1, iodepth=128, fixedbufs=1
: 828 MiB/sdirect=1, iodepth=128, fixedbufs=1, registerfiles=1
: 847 MiB/sdirect=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1
: 1.3 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1
: 1.1 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0
: 781 MiB/sdirect=0, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1
: 693 MiB/sdirect=0, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1
: 691 MiB/sdirect=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=6
: 5.4 GiB/sdirect=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=8
: 6.0 GiB/sdirect=0, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=8
: 4.0 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1, numjobs=8
: 5.2 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=0, numjobs=8
: 4.8 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0, numjobs=8
: 5.7 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0, numjobs=12
: 5.9 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1, numjobs=8
: 6.0 GiB/sdirect=1, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=1, numjobs=8
: 6.0 GiB/s (but I think setting sqthreadpoll=1
might enable registerfiles
?)direct=0, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=1, numjobs=8
: 4.5 GiB/sobject_store
uses)Base config: nrfiles=1000, filesize=256Ki, readwrite=read, direct=0, blocksize=256Ki
:1.9 GiB/s
direct=1
: 4.0 GiB/s (aqu-sz
hovers around 1)direct=1, numjobs=8
: 8.6 GiB/s (aqu-sz
hovers around 14)Base config: nrfiles=1000, filesize=256Ki, readwrite=randread, direct=0, blocksize=4Ki
: 87.8 MiB/s
direct=1
: 91.2 MiB/sdirect=1, numjobs=8
: 638 MiB/sfio
experiments:io_uring
can go faster than the default ioengine. But we have to use direct=1
. And multiple workers help! We can achieve max performance (for both read
and randread
) by using direct=1, sqthreadpoll=1, numjobs=8
.
For sequential reading, io_uring can max-out the SSD's bandwidth and achieves 11.2 GiB/s (12 GB/s), versus 8.6 GiB/s for the default ioengine (a 1.3x improvement).
For random reading 4KiB chunks, io_uring achieves 6 GiB/s (1.5 million IOPs) versus 638 MiB/s for the default ioengine (a 9.4x improvement!).
object_store
is pretty fast at IO (about 7.5 GiB/s on my PCIe 5 SSD). True, it doesn't fully saturate the hardware, but it's still pretty fast. Perhaps I should shift focus, and focus on parallel decompression and an MVP Zarr implementation (in Rust). That would also have the big advantage that I can benchmark exactly what I most care about: speed at reading Zarrs.
So, I think my plan would be something like this:
lsio-uring
crate (or similar name).lsio-zarr
(and MVP Zarr front-end), and lsio-codecs
(which provides async
compression / decompression, and submits the computational work to rayon. Use object_store
as the storage backend.uring performance is looking much better now I've implemented O_DIRECT
! I'm optimistic that uring will substantially beat object_store once we implement #93 and #61
Finally benchmarking again!
In PR #136:
running on my Intel NUC:
cargo run --release -- --filesize=41214400 --nrfiles=100 --blocksize=262144 --nr-worker-threads=1
gets 1,649 MiB/sec! (faster than fio!)
fio
gets 1,210 MiB/s (with a single worker thread): fio --name=foo --rw=read --nrfiles=100 --filesize=4121440 --bs=262144 --direct=1 --iodepth=64 --ioengine=io_uring --directory=/tmp
More threads makes it go SLOWER on my NUC! For example, 4 threads (with lsio_bench) gets 1,067 MiB/s (But I need to test on my workstation...). fio
also goes a bit slower on my NUC with multiple tasks.
iostat -xm -t 1 -p nvme0n1
shows excellent utilisation and long queue depth (aqu-sz).
Woo! Success! My new lsio code gets 10.755 GiB/sec on my EPYC workstation (with a T700 PCIe5 SSD). Commit 1aa2f9150e182334e451b01e20d9d7b60a14de70
That's faster than my old io_uring code. And faster than object_store
! It's not quite as fast as the fastest fio
config. But pretty close!
jack@jack-epyc-workstation:~/dev/rust/light-speed-io/crates/lsio_bench$ cargo run --release -- --filesize=41214400 --nrfiles=100 --blocksize=262144 --nr-worker-threads=8 --directory=/mnt/t700-2tb/lsio_bench
Ha! My lsio code actually gets 11.2 GiB/s when using 500 files! And those read speeds are confirmed by iostat -xm --pretty 1 -p /dev/nvme0n1
!
Ultimate aim: perform at least as well as
fio
when reading from a local SSD :slightly_smiling_face:.Tools
cargo bench
index.html
inlight-speed-io/target/criterion/<GROUP>/<BENCH>/report/index.html
cargo install flamegraph
sudo apt install linux-tools-common linux-tools-generic linux-tools-
uname -r
echo "0" | sudo tee '/proc/sys/kernel/perf_event_paranoid' | sudo tee '/proc/sys/kernel/kptr_restrict'
cargo flamegraph --bench io_uring_local
Benchmark workload
load_1000_files
: Each file is 262,144 bytes. Each file was created byfio
. We measure the total time to load all 1,000 files. The Linux page cache is flushed before each run (vmtouch -e </path/to/files/>
).Plan
criterion
.fio
's runtime!I'll use milestone 2 to keep track of relevant issues, and to prioritise issues.
fio
configuration