aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
136 stars 58 forks source link

How do i increase IO throughput for my training and tuning jobs? #200

Closed cfregly closed 4 months ago

cfregly commented 4 months ago

How do i increase IO throughput for my training and tuning jobs?

cfregly commented 4 months ago

Use local NVMe (/opt/dlami/nvme/) as much as possible, otherwise try S3 alone or FSx+S3.

Try out the latest PyTorch S3 Connector for data loading: https://github.com/awslabs/s3-connector-for-pytorch

Try S3 Express OneZone: https://aws.amazon.com/s3/storage-classes/express-one-zone/

Make sure you are using the native CRT (common runtime) when using the AWS boto3 / Python SDK: https://boto3.amazonaws.com/v1/documentation/api/1.20.41/guide/quickstart.html#using-the-aws-common-runtime-crt. pip install boto3[crt]

Try s5cmd per https://github.com/peak/s5cmd

Compare performance using this script (check for latest on main branch): https://github.com/shimomut/sagemaker-solutions/blob/main/io_speed_test/io_speed_test.py

There are more performance improvements coming for HyperPod.