awslabs / s3-connector-for-pytorch

The Amazon S3 Connector for PyTorch delivers high throughput for PyTorch training jobs that access and store data in Amazon S3.
BSD 3-Clause "New" or "Revised" License
119 stars 18 forks source link

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

Open crazyguitar opened 3 months ago

crazyguitar commented 3 months ago

s3torchconnector version

s3torchconnector-1.2.3

s3torchconnectorclient version

s3torchconnectorclient-1.2.3

AWS Region

us-west-2

Describe the running environment

EC2 instance p4d.24xlarge NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30" Amazon Linux release 2 (Karoo)

What happened?

Hi team,

I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the SageMaker Model Parallelism Library v2 (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.

This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.

Relevant log output

9397  4: [rank34]:   File "/opt/conda/lib/python3.11/site-packages/s3torchconnector/s3writer.py", line 40, in write                                                                                      
9398  4: [rank34]:     self.stream.write(data)                                                                                                                                                           
9399  4: [rank34]: s3torchconnectorclient._mountpoint_s3_client.S3Exception: Client error: Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete

Code of Conduct

IsaevIlya commented 3 months ago

Hi @crazyguitar,

Thank you for sharing your experience with the s3-connector-for-pytorch. The error message "AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already completed" is indeed ambiguous and can be confusing.

It seems that the issue you encountered was related to the object size limit for the given part_size. Increasing the part_size from 8MB to 32MB was the right decision, as it allowed you to upload larger objects without hitting object size limit.

The s3-connector-for-pytorch uses the AWS Common Runtime (CRT) under the hood, which breaks large requests into smaller part-sized requests and executes them in parallel. There could be up to 10,000 parts when writing data to S3, so with a part_size of 8MB, the maximum upload size would be around 80GB. If your model checkpoint was larger than this limit, increasing the part_size was the appropriate solution. Was your model large than 80GB?

I will reach out to the CRT team to discuss if it is possible to provide a more meaningful error message in situations where the object size for upload exceeds the current part_size limit. I will also take a look at our documentation to make it more helpful regarding the usage of part_size.

Thank you for your feedback and for sharing your experience. It will help us improve the user experience and documentation for the s3-connector-for-pytorch.