@devinrsmith pointed out an issue with very large number of threads being spawned when reading partitioned parquet data from S3. This was happening because the codebase was creating a new instance of S3AsyncClient for each partition file discovered and each instance internally by default creates large number of threads.
As part of this PR,
We will share S3AsyncClient for all partition files for the same table.
We will share the underlying threads across S3AsyncClient instances instead of starting new threads for each instance.
Deleted a number of internal public methods from Parquet related classes which accepted Java File objects in favor of those accepting URIs
Documentation Update:
Added two new config parameters:
S3.numFutureCompletionThreads: The number of threads used to complete the futures returned by the async aws s3 client. By default, this is set as the number of processors on the system.
S3.numScheduledExecutorThreads: The number of threads used for scheduling tasks such as async retry attempts and timeout task with the aws s3 client. By default, this is set as 5.
This issue was auto-generated
PR: https://github.com/deephaven/deephaven-core/pull/5451 Author: malhotrashivam
Original PR Body
@devinrsmith pointed out an issue with very large number of threads being spawned when reading partitioned parquet data from S3. This was happening because the codebase was creating a new instance of S3AsyncClient for each partition file discovered and each instance internally by default creates large number of threads.
As part of this PR,
File
objects in favor of those acceptingURI
sDocumentation Update: Added two new config parameters:
S3.numFutureCompletionThreads
: The number of threads used to complete the futures returned by the async aws s3 client. By default, this is set as the number of processors on the system.S3.numScheduledExecutorThreads
: The number of threads used for scheduling tasks such as async retry attempts and timeout task with the aws s3 client. By default, this is set as 5.