Scaling IO with number of threads to achieve high throughput

vustef commented 1 year ago

I benchmarked Julia network IO by issuing 1 MB writes to cloud blob storage (on Azure, using Azure VM), issuing up to N requests concurrently, using each request in separate Julia task. The N here was mostly varied between 256-1024, but it wasn't not the most important factor. The stack was based on HTTP.jl + CloudStore/CloudBase + OpenSSL.

The throughput that we got with single thread was ~700MB/s. ~8 threads bumped that up to 1000-1100 MB/s. By increasing to 32 threads, throughput went back to 700 MB/s. But in comparison to single threaded run, CPU here was almost 3200%, meaning all cores were used. I also tried benchmarking GoLang on the same machine, and was able to roughly saturate VM's NIC (100k Mbps ~ 12500 MB/s)

Looking into profile, it showed that they all do busy waiting for iolock mutex. However, just turning that into blocking wait instead of busy wait would reduce CPU usage, but wouldn't increase throughput.

Obvious culprit is non thread-safety of libuv, which requires global iolock. While single-event loop sounds reasonable, with throughputs of modern storage devices (especially cloud storage and TCP stack, since it's more CPU demanding than disk IO), it's not possible to saturate it with single CPU core, even with async IO, hence the need for scaling IO with multiple CPUs.

Looking at alternatives, I found this issue: https://github.com/libuv/libuv/issues/1595#issuecomment-340119038. And that comment in particular was the starting point of prototype that I tried to hack up. With the prototype of unlocking at points of syscalls and locking afterwards (had to keep the libuv stream locked, but was able to unlock global iolock), throughput went up to 4000MB/s with 8-16 cores. But increasing to more cores still didn't scale it to higher throughputs. For that, it seems that even finer grain locking is required.

While going deeper into the finer grain locking is 1 possible fix, we're thinking about other alternatives:

Using some other thread-safe library, possibly outside of Julia codebase, for an alternative TCPSocket implementation. NanoMSG/NNG come to mind, as the choice is not big, and these already have some Julia bindings (although a bit old).
Directly doing IO in Julia, instead of relying on libuv. The problem here is to make it cross-platform, so idea here is to do it just for Linux (at first), using io_uring, with IO ring per Julia thread, and a single completion queue thread. For non-linux platforms, or when this is not enabled, we fall back to libuv. Also, this would only replace expensive syscalls (read and write), while leaving others to use libuv (e.g. tcpopen etc.) The good thing is that, unlike NNG alternative, it would be doable both for disk and network IO. If this gets us to more than 4000 MB/s, it would be our proposed path forward.

Please let me know what you think, and if this has already been considered perhaps. Thanks!

vustef commented 1 year ago

Here's a repro: IOAzureBench.zip

adnan-alhomssi commented 1 year ago

The challenge with io_uring is how we distribute and access IO rings and how we poll the completion queues. The prototype we started with uses an IO ring per Julia thread and spawns a polling task that continuously pools the submission queue of every thread. The polling task does not have to sync with other tasks. It io_uring_peek_batch_cqe in user-space to check if any IO has completed already (no syscall). The obvious disadvantage is that it keeps a CPU core busy the whole time. It would be better if we can integrate polling in Julia scheduling.

Tasks submit their IO without any user-space locking but has to syscall io_uring_enter for each IO submission which is CPU expensive. With io_uring, we can ask the kernel to spawn a thread that polls all our of submission queues for us, which relieve us from the syscall (unless the kernel goes idle), but this can waste even more CPU.

As already mentioned, we thought about surgically replacing the path to write/read syscall with an io_uring implementation. For Disk/SSD, it is pretty simple. For network, it is more complicated because of the state involved. I guess replacing Base.unsafe_write and Base.start_reading with an io_uring based implementation should do it but I'm not sure and I'd like to hear your opinion on this.

Moelf commented 1 year ago

does this benefit us (eventually)?

https://github.com/libuv/libuv/pull/3952

vustef commented 1 year ago

I am not sure. Libuv bottleneck and its usage in Julia is the global io lock, not that aio is not good enough. Although iouring might improve latency and thus the time spent in the io lock, we will still be limited by it for scaling the submission of IO requests, unless NIC bottleneck is hit first then.

JuliaLang / julia

Scaling IO with number of threads to achieve high throughput #48232