apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.33k stars 684 forks source link

object_store: Using `io_uring`? #4631

Open JackKelly opened 11 months ago

JackKelly commented 11 months ago

Which part is this question about object_store's code.

Describe your question For Zarr, we may want to read on the order of 1 million parts of files per second (from a single machine). It's possible that the only way to achieve this performance will be to use io_uring to send many IO operations to the Linux kernel using just a single system call.

Would object_store ever consider implementing an async io_uring backend for get_ranges? (I may be able to write the PR, with some hand-holding!)

Additional context io_uring is a newish feature of the Linux kernel that allows for requesting many IO operations with a single system call - including local file operations and network operations - without any memory copying, and with minimal system calls. Some database folks seem pretty excited about io_uring. Some benchmarks show that io_uring can deliver almost 20x more IOPs for random reads than the previous approach.

tustvold commented 11 months ago

I would probably want to see some numbers and go from there, it isn't immediately obvious to me that io_uring would be beneficial for reading immutable chunks of data from disk, especially if the workload is doing any non-trivial computation alongside. The major argument I've heard is for systems doing mixed IO, or with custom buffer pooling, neither of which object store is

JackKelly commented 11 months ago

OK, cool, that's good to know. Thank you for your quick reply. No worries at all if object_store isn't the right place for this functionality.

Just to make sure... please let me give a little more detail about what I'd ultimately like to do...

First, some context: Zarr has been around for a while. As you probably know, the main idea behind Zarr is very simple: We take a large multi-dimensional array and save it to disk as multi-dimensional, compressed chunks. The user can request an arbitrary slice of the overall array, and Zarr will load the appropriate chunks, decompress them, and merge them into a single ndarray. Zarr-Python, the main implementation of Zarr, is currently single-threaded.

We're now exploring ways to use multiple CPU cores in parallel to load, decompress, and copy each decompressed Zarr chunk into a "final" array, as fast as possible. (Many Zarr users would benefit if Zarr could max-out the hardware).

If we were to implement our own IO backend using io_uring, we might first submit our queue of, say, 1 million read operations to the kernel. Then we'd have a thread pool (or perhaps we'd use an async executor) with roughly as many threads as there are logical CPU cores. Each worker thread would run a loop which starts by grabbing data from the io_uring completion queue, then immediately decompresses the chunk, and then - while the decompressed data is still in the CPU cache - write the decompressed chunk into the final array in RAM. So we'd need the load, decompression, and copy steps to happen in very quick succession; and ideally within a single thread per chunk (to make the code as "cache-friendly" as possible).

Would you say that object_store isn't the right place to implement this batched, parallel "load-decompress-copy" functionality? Even if object_store implemented an io_uring backend, my guess is that it wouldn't be appropriate to modify object_store to allow for processing to be done on chunk n-1 whilst chunk n is still being loaded. (If that makes sense?!) Instead, we'd first call object_stores's get_ranges function. Then we'd await the Future returned by get_ranges, which will only return data when all the chunks have been loaded. So we couldn't simultaneously decompress chunk n-1 whilst loading chunk n. Is that right?

tustvold commented 11 months ago

I would suggest first getting something simple working with tokio::spawn, or some other threadpool abstraction, and the existing APIs, and then go from there. I would recommend against reaching for solutions like io_uring until you have confirmed that simpler solutions are insufficient, from what I understand of your use-case I'm not sure io_uring would yield tangible benefits.

JackKelly commented 5 months ago

Just a quick update... I am hoping to provide some benchmarks within a few months. More details here: https://github.com/JackKelly/light-speed-io/issues/27