apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.35k stars 3.49k forks source link

[C++][FS][Azure] Expose parallel transfer config options available in the Azure SDK #40035

Open Tom-Newton opened 7 months ago

Tom-Newton commented 7 months ago

Describe the enhancement requested

Optimisation to https://github.com/apache/arrow/issues/37511 Child of https://github.com/apache/arrow/issues/18014

When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.

As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native AzureFileSystem:

  1. Arrow will make concurrent calls to ReadAt for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.
  2. Within ReadAt the AzureFileSystem calls BlobClient::DownloadTo which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.

Component(s)

C++

Tom-Newton commented 7 months ago

I'm going to run some performance tests for my usecase.

Tom-Newton commented 7 months ago

I can measure a clear performance advantage from tuning these parameters a bit. Defaults: median=9.0 seconds P95=12.5 seconds

Setting

        initial_chunk_size=2*1024*1024,
        chunk_size=2*1024 * 1024,
        concurrency=10

median=5.6 seconds P95=6.7 seconds

I did not try especially hard to tune this but I think this is enough evidence to justify exposing these config options. Exposing these configs options is easy.

Tom-Newton commented 7 months ago

take

Tom-Newton commented 7 months ago

Given how small this change is I think I will make one PR for C++ and Python. Therefore I will wait for https://github.com/apache/arrow/pull/40021 to merge first.

felipecrv commented 7 months ago

I think exposing all these settings in AzureOptions can be premature. They are per-request settings, so allowing a config in AzureOptions would force the internal implementation to stick to one set of values of every DownloadTo request.

My suggestion: we keep statistics about the ReadAt calls in a file handle and adjust the options as the calls come in. After this exercise we might expose settings in AzureOptions that expresses which policy should be used (assuming we can't come up with a good adaptive policy). What the policies would be called depends on which workloads we can isolate in the benchmarks: latency vs throughput, sequential vs random.

An alternative to the named policies can be: we expose only read_file_max_concurrency [1], and we can keep learning a simple model that sets the best initial_chunk_size and chunk_size parameters adaptively. The goal becomes minimizing the latency of each individual ReadAt call and the user can set a low max concurrency factor for latency-optimized workloads and a high concurrency factor for throughput-optimized. A low concurrency factor would also be useful when the user of the FileSystem interface is managing multiple threads themselves.

[1] set it to some multiple of CPU cores by default

Tom-Newton commented 7 months ago

These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.

felipecrv commented 7 months ago

These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.

Start exposing read_file_max_concurrency with a generous value and hardcode values for initial_chunk_size and chunk_size inside azurefs.cc to your liking, then open a separate issue that I can work on.

My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are difficult to tweak in a well-informed way, and (2) preventing future optimizations on our side based on real-world workloads.

The default settings of the SDK seem to be very conservative regarding parallelization because most SDK users arelikely making full-blob downloads inside a system that already manages multiple I/O threads -- that explains their huge threshold for parallelization (>256MB). We know Arrow workloads are probably small to medium ReadAt calls to the same file on a single thread (a Jupyter notebook driving the calls) so we benefit from the parallelism offered by the SDK: lower initial_chunk_size and chunk_size with a high concurrency.

Tom-Newton commented 6 months ago

I unassigned myself because I don't really know when I can work on this, but I might pick it up again at a later date.

Tom-Newton commented 4 months ago

I've been thinking a bit about a policy we can use to set these parameters automatically (as @felipecrv suggested). So far my only idea is to make each call to ReadAt one iteration of an optimisation algorithm e.g. gradient descent optimising for ReadAt duration.

If we imagine varying chunk_size from very small to very large the optimal performance is going to be somewhere in the middle and I expect the optimum will depend on latency and bandwidth between blob storage and the client.

I've also taken a bit of a look at how azcopy implements this since my experience is that azcopy is very fast in a wide variety of situations without needing to provide any configuration. I thought it was interesting that it deliberately tries to read out of order https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/cmd/copyEnumeratorHelper.go#L21 I think the downloading actually happens here https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/ste/xfer-remoteToLocal-file.go#L307-L332 There is mention of applying some auto-correct policy on the block-size if the user did not set it.

I'm definitely going to dig a bit deeper into what azcopy does before I commit to a particular strategy.