apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.56k forks source link

[C++] Add GCS connection pool size option #20314

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Multi-threaded read performance in Arrow's GCS file system implementation currently is relatively low. Given the high latency of cloud blob systems like GCS, a common strategy is to use many concurrent readers (if the system has enough memory to support that), e.g. using 100 threads.

The GCS client library offers a ConnectionPoolSize option. If this option is set to a value that's too low, concurrency is throttled. At the moment, this is not exposed in GcsOptions, consequently limiting multi-threaded throughput.

Instead of exposing this option, an alternative implementation strategy could be to use the same value as set by arrow::io::SetIOThreadPoolCapacity.

Reporter: Leonhard Gruenschloss

Note: This issue was originally created as ARROW-17033. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: cc @westonpace @wjones127

asfimport commented 2 years ago

Weston Pace / @westonpace: This probably deserves some testing and profiling. At a first glance at the linked doc for ConnectionPoolSizeOption however I see:

The library may create more connections than this option configures, for example if your application requests many simultaneous downloads.

It seems like this option shouldn't prevent concurrency. Also, we should see if we can find some concrete guidance on the number of threads. For example, S3 recommends "Make one concurrent request for each 85–90 MB/s of desired network throughput"

If the ideal concurrency really is 100 threads we should, for now, document this somewhere visible to users so they know to bump the I/O thread pool capacity. In the future we should find a way to adjust the I/O thread pool capacity automatically but this is a more considerable task.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: cc @benibus