Closed philmod-h closed 2 weeks ago
Hi @philmod-h 2.5.0 is skipped due to issue with AL2023, we will start working on 2.5.1 directly this week. In the meantime, you can extend the 2.4 DLC and uninstall the pytorch binary and install 2.5.1 or nightly.
Here is an example: https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/distributed-training/pytorch/pytorchjob/fsdp/Dockerfile.llama2-efa-dlc#L31 https://aws.amazon.com/blogs/machine-learning/scale-llms-with-pytorch-2-0-fsdp-on-amazon-eks-part-2/
We don't recommend using pytorch 2.5.0, FYI: https://github.com/pytorch/pytorch/issues/138324
Check for customer notification for new releases: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notifications.html or check our available images for updates https://github.com/aws/deep-learning-containers/blob/master/available_images.md
Closing the issue, please reopen if there are still issues
@sallyseok Why was this issue closed? The image still doesn't exist:
$ docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.0-gpu-py311
Error response from daemon: manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.0-gpu-py311 not found: manifest unknown: Requested image not found
$ docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311
Error response from daemon: manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311 not found: manifest unknown: Requested image not found
Opened a new issue as I cannot reopen this one: https://github.com/aws/deep-learning-containers/issues/4404
Concise Description:
We are upgrading our training infra to pytorch 2.5, so we also need the inference image with the same version:
Describe the solution you'd like
I tried unsuccessfully to build it myself following the instructions in this repo.