Configure parallelism - Githubissues

awslabs / s3-connector-for-pytorch

The Amazon S3 Connector for PyTorch delivers high throughput for PyTorch training jobs that access and store data in Amazon S3.

BSD 3-Clause "New" or "Revised" License

120 stars 18 forks source link

Hello @tsdev, The s3-connector-for-pytorch uses the CRT client under the hood. We are exposing CRT's throughput_target_gbps and part_size confiurations. For small files, increasing the part size may not help, but you can try increasing the throughput_target_gbps parameter beyond the default 10 Gbps. This will hint to the CRT clients that you require higher throughput, potentially improving the loading speed for your small image files. Please refer to the documentation for more information on these parameters and their usage.

Additionally, if possible, you can preprocess your dataset to combine small files into larger shards. The section "Training data formats for Amazon S3" in documentation provides guidance on accessing sharded data, which can improve performance when working with numerous small files. By combining small files into larger ones, you can potentially achieve higher throughput during training.

awslabs / s3-connector-for-pytorch

Configure parallelism #215