Open crazyguitar opened 3 months ago
Hi @crazyguitar,
Thank you for sharing your experience with the s3-connector-for-pytorch. The error message "AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already completed" is indeed ambiguous and can be confusing.
It seems that the issue you encountered was related to the object size limit for the given part_size
. Increasing the part_size
from 8MB to 32MB was the right decision, as it allowed you to upload larger objects without hitting object size limit.
The s3-connector-for-pytorch uses the AWS Common Runtime (CRT) under the hood, which breaks large requests into smaller part-sized requests and executes them in parallel. There could be up to 10,000 parts when writing data to S3, so with a part_size of 8MB, the maximum upload size would be around 80GB. If your model checkpoint was larger than this limit, increasing the part_size
was the appropriate solution. Was your model large than 80GB?
I will reach out to the CRT team to discuss if it is possible to provide a more meaningful error message in situations where the object size for upload exceeds the current part_size
limit. I will also take a look at our documentation to make it more helpful regarding the usage of part_size
.
Thank you for your feedback and for sharing your experience. It will help us improve the user experience and documentation for the s3-connector-for-pytorch.
s3torchconnector version
s3torchconnector-1.2.3
s3torchconnectorclient version
s3torchconnectorclient-1.2.3
AWS Region
us-west-2
Describe the running environment
EC2 instance p4d.24xlarge NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30" Amazon Linux release 2 (Karoo)
What happened?
Hi team,
I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the SageMaker Model Parallelism Library v2 (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.
This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.
Relevant log output
Code of Conduct