Performance tracking issue (reading from a local SSD)

Ultimate aim: perform at least as well as fio when reading from a local SSD :slightly_smiling_face:.

Tools

Criterion.rs: "a statistics-driven micro-benchmarking tool". Of relevance for light-speed-io, Criterion can be told to do some setup outside of the main benchmarking code (e.g. clearing the Linux page cache); and can record throughput in bytes per second.
- Usage:
  - cargo bench
  - Then open index.html in light-speed-io/target/criterion/<GROUP>/<BENCH>/report/index.html
[cargo-]flamegraph: "A Rust-powered flamegraph generator with additional support for Cargo projects!"
- Setup:
  - cargo install flamegraph
  - sudo apt install linux-tools-common linux-tools-generic linux-tools-uname -r
  - echo "0" | sudo tee '/proc/sys/kernel/perf_event_paranoid' | sudo tee '/proc/sys/kernel/kptr_restrict'
- Usage:
  - cargo flamegraph --bench io_uring_local

Benchmark workload

load_1000_files: Each file is 262,144 bytes. Each file was created by fio. We measure the total time to load all 1,000 files. The Linux page cache is flushed before each run (vmtouch -e </path/to/files/>).

Plan

Use the flamegraph to identify hotspots.
Attempt to optimise those hotspots.
Measure runtimes with criterion.
Repeat until the runtime is comparable to fio's runtime!

I'll use milestone 2 to keep track of relevant issues, and to prioritise issues.

`fio` configuration

[global]
nrfiles=1000
filesize=256k
direct=1
iodepth=16
ioengine=io_uring
bs=128k
numjobs=1

[reader1]
rw=read
directory=/home/jack/temp/fio

Performance of the un-optimised code

This is for the code in main at commit ef8c7b7d564ddf1dd9ef68240dc52ebef228d4a0.

flamegraph

Some conclusions:

The majority of time (the wide "mountain" in the middle of this flamegraph) is spent in light_speed_io::io_uring_local::worker_thread_func. In turn, the functions which make up most of the time in worker_thread_func are (in order, the longest-running first):

io_cqring_wait (this is the longest-running function by some margin)
light_speed_io::Operation::to_iouring_entry.
io_submit_sqes

So, I think the priority is #49.

If we zoom into light_speed_io::Operation::to_iouring_entry, we can see the relative importance of these improvements:

Big breakthrough: Today, I figured out that I was doing something stupid! TL;DR: We're now getting throughput up to 960 MiB/s (up from about 220 MiB/s!) (i.e. better than a 4x speedup!).

LSIO now compares very favorably against fio and object_store (for reading 1,000 files, each file is 256 kB, on my old Intel NUC box). fio gets, at best, about 900 MiB/s. object_store::LocalFileSystem::get gets about 250 MiB/s! :slightly_smiling_face:

What I had forgotten is that, in Rust, an async function isn't polled until we call await on the Future returned by the function. So we weren't actually submitting multiple reads concurrently! There was only ever one operation in flight in io_uring at any one time.

This was fixed by changing async fn get to fn get, and returning a Box::pin(async {...}).

New flamegraph:

flamegraph

First results running LSIO on my new AMD Epyc workstation

I just built an AMD Epyc workstation with two PCIe5 SSDs: one for the OS, one just for benchmarking.

Running cargo bench gives a nasty surprise!

     Running benches/get.rs (target/release/deps/get-766f6439cf0e228e)
get_1000_whole_files/uring_get
                        time:   [118.45 ms 124.76 ms 131.25 ms]
                        thrpt:  [1.8601 GiB/s 1.9568 GiB/s 2.0611 GiB/s]
                 change:
                        time:   [-9.4570% -3.5549% +2.7355%] (p = 0.27 > 0.05)
                        thrpt:  [-2.6627% +3.6859% +10.445%]
                        No change in performance detected.
Benchmarking get_1000_whole_files/local_file_system_get: Warming up for 2.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.2s or enable flat sampling.
get_1000_whole_files/local_file_system_get
                        time:   [31.853 ms 32.297 ms 33.342 ms]
                        thrpt:  [7.3223 GiB/s 7.5592 GiB/s 7.6647 GiB/s]
                 change:
                        time:   [-10.750% +0.5216% +13.785%] (p = 0.95 > 0.05)
                        thrpt:  [-12.115% -0.5189% +12.045%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

get_16384_bytes_from_1000_files/uring_get_range
                        time:   [22.219 ms 22.424 ms 22.736 ms]
                        thrpt:  [687.24 MiB/s 696.79 MiB/s 703.24 MiB/s]
                 change:
                        time:   [-3.7240% -0.8606% +1.7768%] (p = 0.59 > 0.05)
                        thrpt:  [-1.7457% +0.8681% +3.8681%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
get_16384_bytes_from_1000_files/local_file_system_get_range
                        time:   [8.5492 ms 8.6767 ms 8.9215 ms]
                        thrpt:  [1.7103 GiB/s 1.7586 GiB/s 1.7848 GiB/s]
                 change:
                        time:   [-13.011% +1.2443% +18.663%] (p = 0.89 > 0.05)
                        thrpt:  [-15.728% -1.2291% +14.957%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

My io_uring code is quite a bit slower than the equivalent object_store code.

Why is my io_uring code slower? And how to speed up my io_uring code?

AFAICT, a problem with my io_uring code is that it fails to keep the OS IO queue topped up. Running iostat -xm --pretty 1 -p /dev/nvme0n1 (and looking at the aqu-sz column) shows that, when the benchmark get_1000_whole_files/uring_get is running, the IO queue is only between 1 and 2. But when the object_store bench is running, the IO queue is more like 120!

I think the solution is to stop using fixed files in io_uring, which then allows me to have more than 16 files in flight at any one time. And/or perhaps the solution is #75.

That said, fio still achieves 5.3 MiB/s with an IO depth of 1.

`fio` experiments:

io_uring

Sequentially reading 1,000 files

Base config: nrfiles=1000, filesize=256Ki, iodepth=1, ioengine=io_uring, readwrite=read, direct=0, blocksize=256Ki: 1.5 GiB/s

direct=0, iodepth=16: 1.8 GiB/s (and aqu-sz stays around 0.4)
direct=0, iodepth=128: 1.8 GiB/s (and aqu-sz stays around 0.4)
direct=0, iodepth=16, fixedbufs=0, registerfiles=1, sqthreadpoll=0: 1.8 GiB/s (aqu-sz gets to 0.4)
direct=0, iodepth=16, fixedbufs=0, registerfiles=1, sqthreadpoll=1: 2.4 GiB/s (aqu-sz gets to 1.2)
direct=0, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 2.4 GiB/s (aqu-sz gets to 1.2)
direct=0, iodepth=16, numjobs=4: 6.8 GiB/s
direct=1: 4.1 GiB/s (aqu-sz 1.05)
direct=1, iodepth=16: 9.0 GiB/s (aqu-sz 17)
direct=1, iodepth=16, numjobs=4: 11.2 GiB/s (aqu-sz 116)
direct=1, iodepth=16, sqthread_poll=1: 10.7 GiB/s
direct=1, iodepth=16, fixedbufs=1: 10.1 GiB/s
direct=1, iodepth=16, fixedbufs=1, registerfiles=1: 10.9 GiB/s
direct=1, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 10.9 GiB/s
direct=1, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=4: 11.2 GiB/s (12 GB/s)

Randread 4KiB chunks from 1,000 files

Base config: nrfiles=1000, filesize=256Ki, iodepth=1, ioengine=io_uring, readwrite=randread, direct=0, blocksize=4Ki: 86 MiB/s

direct=1: 89 MiB/s
direct=1, iodepth=16: 758 MiB/s
direct=1, iodepth=128: 769 MiB/s
direct=1, iodepth=128, fixedbufs=1: 828 MiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1: 847 MiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 1.3 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1: 1.1 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0: 781 MiB/s
direct=0, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 693 MiB/s
direct=0, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1: 691 MiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=6: 5.4 GiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=8: 6.0 GiB/s
direct=0, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=8: 4.0 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1, numjobs=8: 5.2 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=0, numjobs=8: 4.8 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0, numjobs=8: 5.7 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0, numjobs=12: 5.9 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1, numjobs=8: 6.0 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=1, numjobs=8: 6.0 GiB/s (but I think setting sqthreadpoll=1 might enable registerfiles?)
direct=0, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=1, numjobs=8: 4.5 GiB/s

default ioengine (supposedly what `object_store` uses)

Sequential read 1,000 files

Base config: nrfiles=1000, filesize=256Ki, readwrite=read, direct=0, blocksize=256Ki:1.9 GiB/s

direct=1: 4.0 GiB/s (aqu-sz hovers around 1)
direct=1, numjobs=8: 8.6 GiB/s (aqu-sz hovers around 14)

Randread 4KiB chunks from 1,000 files

Base config: nrfiles=1000, filesize=256Ki, readwrite=randread, direct=0, blocksize=4Ki: 87.8 MiB/s

direct=1: 91.2 MiB/s
direct=1, numjobs=8: 638 MiB/s

Conclusions of `fio` experiments:

io_uring can go faster than the default ioengine. But we have to use direct=1. And multiple workers help! We can achieve max performance (for both read and randread) by using direct=1, sqthreadpoll=1, numjobs=8.

For sequential reading, io_uring can max-out the SSD's bandwidth and achieves 11.2 GiB/s (12 GB/s), versus 8.6 GiB/s for the default ioengine (a 1.3x improvement).

For random reading 4KiB chunks, io_uring achieves 6 GiB/s (1.5 million IOPs) versus 638 MiB/s for the default ioengine (a 9.4x improvement!).

Pause working on io_uring and, instead, focus on building a full Zarr implementation with parallel decompression?

object_store is pretty fast at IO (about 7.5 GiB/s on my PCIe 5 SSD). True, it doesn't fully saturate the hardware, but it's still pretty fast. Perhaps I should shift focus, and focus on parallel decompression and an MVP Zarr implementation (in Rust). That would also have the big advantage that I can benchmark exactly what I most care about: speed at reading Zarrs.

So, I think my plan would be something like this:

Pause work on io_uring
Make sure I correctly categorise & describe github issues relating to io_uring, so I can pick it up again later. io_uring definitely appears necessary to get full speed, especially for random reads.
- Create a "component" field for each item in the project, and set all these existing issues to the io_uring component.
94
Move my io_uring code into an lsio-uring crate (or similar name).
Plan two new crates (within the LSIO repo): lsio-zarr (and MVP Zarr front-end), and lsio-codecs (which provides async compression / decompression, and submits the computational work to rayon. Use object_store as the storage backend.
Start sketching out the interfaces between these crates. Think about use-cases like converting GRIB to Zarr.

uring performance is looking much better now I've implemented O_DIRECT! I'm optimistic that uring will substantially beat object_store once we implement #93 and #61

Finally benchmarking again!

In PR #136:

running on my Intel NUC:

cargo run --release -- --filesize=41214400 --nrfiles=100 --blocksize=262144 --nr-worker-threads=1

gets 1,649 MiB/sec! (faster than fio!)

fio gets 1,210 MiB/s (with a single worker thread): fio --name=foo --rw=read --nrfiles=100 --filesize=4121440 --bs=262144 --direct=1 --iodepth=64 --ioengine=io_uring --directory=/tmp

More threads makes it go SLOWER on my NUC! For example, 4 threads (with lsio_bench) gets 1,067 MiB/s (But I need to test on my workstation...). fio also goes a bit slower on my NUC with multiple tasks.

iostat -xm -t 1 -p nvme0n1 shows excellent utilisation and long queue depth (aqu-sz).

Woo! Success! My new lsio code gets 10.755 GiB/sec on my EPYC workstation (with a T700 PCIe5 SSD). Commit 1aa2f9150e182334e451b01e20d9d7b60a14de70

That's faster than my old io_uring code. And faster than object_store! It's not quite as fast as the fastest fio config. But pretty close!

jack@jack-epyc-workstation:~/dev/rust/light-speed-io/crates/lsio_bench$ cargo run --release -- --filesize=41214400 --nrfiles=100 --blocksize=262144 --nr-worker-threads=8 --directory=/mnt/t700-2tb/lsio_bench

Ha! My lsio code actually gets 11.2 GiB/s when using 500 files! And those read speeds are confirmed by iostat -xm --pretty 1 -p /dev/nvme0n1!

JackKelly / light-speed-io