Open tsdev opened 4 months ago
Hello @tsdev,
The s3-connector-for-pytorch uses the CRT client under the hood. We are exposing CRT's throughput_target_gbps
and part_size
confiurations. For small files, increasing the part size may not help, but you can try increasing the throughput_target_gbps
parameter beyond the default 10 Gbps. This will hint to the CRT clients that you require higher throughput, potentially improving the loading speed for your small image files. Please refer to the documentation for more information on these parameters and their usage.
Additionally, if possible, you can preprocess your dataset to combine small files into larger shards. The section "Training data formats for Amazon S3" in documentation provides guidance on accessing sharded data, which can improve performance when working with numerous small files. By combining small files into larger ones, you can potentially achieve higher throughput during training.
I am trying to load a high number of small image files from S3 (each is around 100 kB - 1MB in size) for model training. Currently I achieve 100 image / sec loading from a single AZ bucket using 32 dataloader workers. Overall the issue is that I cannot control the internal parallelism of s3-connector-for-pytorch. So my questions are: