Open asfimport opened 2 years ago
Antoine Pitrou / @pitrou: cc @westonpace @wjones127
Weston Pace / @westonpace:
This probably deserves some testing and profiling. At a first glance at the linked doc for ConnectionPoolSizeOption
however I see:
The library may create more connections than this option configures, for example if your application requests many simultaneous downloads.
It seems like this option shouldn't prevent concurrency. Also, we should see if we can find some concrete guidance on the number of threads. For example, S3 recommends "Make one concurrent request for each 85–90 MB/s of desired network throughput"
If the ideal concurrency really is 100 threads we should, for now, document this somewhere visible to users so they know to bump the I/O thread pool capacity. In the future we should find a way to adjust the I/O thread pool capacity automatically but this is a more considerable task.
Antoine Pitrou / @pitrou: cc @benibus
Multi-threaded read performance in Arrow's GCS file system implementation currently is relatively low. Given the high latency of cloud blob systems like GCS, a common strategy is to use many concurrent readers (if the system has enough memory to support that), e.g. using 100 threads.
The GCS client library offers a
ConnectionPoolSize
option. If this option is set to a value that's too low, concurrency is throttled. At the moment, this is not exposed inGcsOptions
, consequently limiting multi-threaded throughput.Instead of exposing this option, an alternative implementation strategy could be to use the same value as set by
arrow::io::SetIOThreadPoolCapacity
.Reporter: Leonhard Gruenschloss
Note: This issue was originally created as ARROW-17033. Please see the migration documentation for further details.