datafusion-contrib / arrow-zarr

Implementation of Zarr file format in Rust
Apache License 2.0
10 stars 1 forks source link

Try to optimize the file opener for local and remote reads #14

Closed tshauck closed 6 months ago

tshauck commented 7 months ago

It may be possible to improve performance by only using the async reader in the case of remote files. So, this ticket would be to do a spike for that in the context of the already implemented zarr readers.

For context, an object store get returns https://docs.rs/object_store/latest/object_store/enum.GetResultPayload.html, which is an enum with two variants depending on if the file is local or remote.

maximedion2 commented 7 months ago

So this here is to start using the object store for local files as well right?

maximedion2 commented 7 months ago

Ah, no I see, you mean that we can use the object store to check if a file is local or remote, in the FileOpener, and if it's local, we use the sync reader instead of the async reader? Yeah I think that can be easily done, I'll take a look.

Btw, do you have have any insights into why reading files asynchronously might not provide performance benefits? There's that guy, Jack Kelly, I mentioned him before, who's looking into more efficient async reads for local files, not just for Zarr (though I think that's the main focus), he mentioned that in general reading files asynchronously (without io-uring, the mechanism he's looking into) is not efficient, is that what we're seeing here?

tshauck commented 7 months ago

Apologies for the delay, been busy w/ $JOB... I haven't done any benchmarking here, but I think what Jack Kelly mentions generally right (he probably knows a lot more than me here 😄). E.g. consider seek operations, a remote file requires additional requests to "seek" to a position or maybe you can optimize the call to get a larger byte range, etc, but certainly more complex than local seeks.

maximedion2 commented 7 months ago

No worries, like I said in general I don't make the progress I'd want to make on this mostly due to work as well. In any case, yeah that makes sense, I think that's in line with what I saw when I spent a good amount of time on rewriting the async reader, I think I did it right in that it doesn't "block the thread", but even when I made the "compute" part of the process, decompressing the data, artificially long so that it would compare to how long it takes to read files, I didn't see any improvements when comparing the async reader to the sync reader, for local files. It's probably because reading local file async doesn't really leverage the async part in the first place.

Okay I'll get started on this, shouldn't take too long I think.