Open Tom-Newton opened 7 months ago
I'm going to run some performance tests for my usecase.
I can measure a clear performance advantage from tuning these parameters a bit. Defaults: median=9.0 seconds P95=12.5 seconds
Setting
initial_chunk_size=2*1024*1024,
chunk_size=2*1024 * 1024,
concurrency=10
median=5.6 seconds P95=6.7 seconds
I did not try especially hard to tune this but I think this is enough evidence to justify exposing these config options. Exposing these configs options is easy.
take
Given how small this change is I think I will make one PR for C++ and Python. Therefore I will wait for https://github.com/apache/arrow/pull/40021 to merge first.
I think exposing all these settings in AzureOptions
can be premature. They are per-request settings, so allowing a config in AzureOptions
would force the internal implementation to stick to one set of values of every DownloadTo
request.
My suggestion: we keep statistics about the ReadAt
calls in a file handle and adjust the options as the calls come in. After this exercise we might expose settings in AzureOptions
that expresses which policy should be used (assuming we can't come up with a good adaptive policy). What the policies would be called depends on which workloads we can isolate in the benchmarks: latency vs throughput, sequential vs random.
An alternative to the named policies can be: we expose only read_file_max_concurrency
[1], and we can keep learning a simple model that sets the best initial_chunk_size
and chunk_size
parameters adaptively. The goal becomes minimizing the latency of each individual ReadAt
call and the user can set a low max concurrency factor for latency-optimized workloads and a high concurrency factor for throughput-optimized. A low concurrency factor would also be useful when the user of the FileSystem
interface is managing multiple threads themselves.
[1] set it to some multiple of CPU cores by default
These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.
These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that.
Start exposing read_file_max_concurrency
with a generous value and hardcode values for initial_chunk_size
and chunk_size
inside azurefs.cc
to your liking, then open a separate issue that I can work on.
My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are difficult to tweak in a well-informed way, and (2) preventing future optimizations on our side based on real-world workloads.
The default settings of the SDK seem to be very conservative regarding parallelization because most SDK users arelikely making full-blob downloads inside a system that already manages multiple I/O threads -- that explains their huge threshold for parallelization (>256MB
). We know Arrow workloads are probably small to medium ReadAt
calls to the same file on a single thread (a Jupyter notebook driving the calls) so we benefit from the parallelism offered by the SDK: lower initial_chunk_size
and chunk_size
with a high concurrency
.
I unassigned myself because I don't really know when I can work on this, but I might pick it up again at a later date.
I've been thinking a bit about a policy we can use to set these parameters automatically (as @felipecrv suggested). So far my only idea is to make each call to ReadAt
one iteration of an optimisation algorithm e.g. gradient descent optimising for ReadAt
duration.
If we imagine varying chunk_size
from very small to very large the optimal performance is going to be somewhere in the middle and I expect the optimum will depend on latency and bandwidth between blob storage and the client.
I've also taken a bit of a look at how azcopy
implements this since my experience is that azcopy
is very fast in a wide variety of situations without needing to provide any configuration.
I thought it was interesting that it deliberately tries to read out of order https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/cmd/copyEnumeratorHelper.go#L21
I think the downloading actually happens here https://github.com/Azure/azure-storage-azcopy/blob/dae00e95050b5e3308106fd15313963694db18a8/ste/xfer-remoteToLocal-file.go#L307-L332
There is mention of applying some auto-correct policy on the block-size if the user did not set it.
I'm definitely going to dig a bit deeper into what azcopy
does before I commit to a particular strategy.
Describe the enhancement requested
Optimisation to https://github.com/apache/arrow/issues/37511 Child of https://github.com/apache/arrow/issues/18014
When reading from Azure blob storage the bandwidth we get per connection is very dependant on the latency to the filesystem. To achieve good bandwidth with high latency far greater concurrency is needed. For example this is relevant when reading from blob storage in a different region to your compute.
As an example lets consider reading a parquet file. There are 2 levels of parallelism that I'm aware of when using Arrow and the native
AzureFileSystem
:ReadAt
for each column and row group combination. At most we can have one concurrent connection per column and row group combination, so for small parquet files this may be less than we would like.ReadAt
theAzureFileSystem
callsBlobClient::DownloadTo
which implements some extra concurrency internally https://github.com/Azure/azure-sdk-for-cpp/blob/ddd0f4bd075d6715ac3004136a690445c4cde5c2/sdk/storage/azure-storage-blobs/src/blob_client.cpp#L516. Purpose of this issue is to make the config options for this parallelism configurable by the user.Component(s)
C++