Closed cfregly closed 4 months ago
Use local NVMe (/opt/dlami/nvme/) as much as possible, otherwise try S3 alone or FSx+S3.
Try out the latest PyTorch S3 Connector for data loading: https://github.com/awslabs/s3-connector-for-pytorch
Try S3 Express OneZone: https://aws.amazon.com/s3/storage-classes/express-one-zone/
Make sure you are using the native CRT (common runtime) when using the AWS boto3 / Python SDK: https://boto3.amazonaws.com/v1/documentation/api/1.20.41/guide/quickstart.html#using-the-aws-common-runtime-crt. pip install boto3[crt]
Try s5cmd
per https://github.com/peak/s5cmd
Compare performance using this script (check for latest on main
branch): https://github.com/shimomut/sagemaker-solutions/blob/main/io_speed_test/io_speed_test.py
There are more performance improvements coming for HyperPod.
How do i increase IO throughput for my training and tuning jobs?