apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.45k stars 726 forks source link

Ability to chunk download from object store #6106

Open trungda opened 1 month ago

trungda commented 1 month ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. When downloading large objects (> 300MBs) using object_store crate, I often hit timeout using the default configuration (30 seconds connection timeout). Interestingly, when increasing the timeout, the download speed is actual lower (not sure if it's the same for everyone?).

Describe the solution you'd like I am thinking if it makes sense to chunk a file into smaller ranges (say, 100MB each), and in parallel, download each range with different connection and reconcile them under the same interface.

Describe alternatives you've considered Not sure if such a capability can be composed using the existing interfaces.

Additional context

trungda commented 1 month ago

I originally submitted this issue in the datafusion repo which I think is the wrong repo. Quote reply from @alamb

Thank you @trungda

I think it would be very interesting to build a "parallel downloader" ObjectStore implementation, though I am not sure it necessairly belongs in the core object_store crate (though it could be added if there is enough interest)

There might also be some interesting ideas to explore around "racing reads" to avoid latency

There are many good ideas in this paper, BTW: https://dl.acm.org/doi/10.14778/3611479.3611486

I think you could compose this kind of smart client from the existing interfaces

tustvold commented 1 month ago

It should be relatively straightforward to achieve this using buffer_ordered from the futures crate, we may just need to document how to do this

alamb commented 1 month ago

Maybe it would make a good example

trungda commented 1 month ago

I can write an example. Using buffered is what we are doing to download multiple files concurrently. Something like this:

let parallelism = 10;
let mut downloaders = Vec::new();
for path in paths.iter() {
  downloaders.push(download(path)); <----This downloads the whole file.
}
let mut buffered = stream.buffered(parallelism);
while let Some(_) = buffered.next().await {}

But it's not obvious for me how to use the stream interface with bufferred, i.e., how can we reconcile different streams (from different parts of the file) into one stream, but is it something really needed?

alamb commented 1 month ago

how can we reconcile different streams (from different parts of the file) into one stream

I was imagining that it would look something like making multiple calls to ObjectStore::get_ranges for each file