aws / deep-learning-containers

AWS Deep Learning Containers are pre-built Docker images that make it easier to run popular deep learning frameworks and tools on AWS.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html
Other
1.01k stars 464 forks source link

[feature-request] Inference image for pytorch 2.5 #4398

Closed philmod-h closed 2 weeks ago

philmod-h commented 3 weeks ago

Concise Description:

We are upgrading our training infra to pytorch 2.5, so we also need the inference image with the same version:

$ docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.0-gpu-py311
Error response from daemon: manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.0-gpu-py311 not found: manifest unknown: Requested image not found

Describe the solution you'd like

I tried unsuccessfully to build it myself following the instructions in this repo.

roywei commented 2 weeks ago

Hi @philmod-h 2.5.0 is skipped due to issue with AL2023, we will start working on 2.5.1 directly this week. In the meantime, you can extend the 2.4 DLC and uninstall the pytorch binary and install 2.5.1 or nightly.

Here is an example: https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/distributed-training/pytorch/pytorchjob/fsdp/Dockerfile.llama2-efa-dlc#L31 https://aws.amazon.com/blogs/machine-learning/scale-llms-with-pytorch-2-0-fsdp-on-amazon-eks-part-2/

We don't recommend using pytorch 2.5.0, FYI: https://github.com/pytorch/pytorch/issues/138324

Check for customer notification for new releases: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notifications.html or check our available images for updates https://github.com/aws/deep-learning-containers/blob/master/available_images.md

sallyseok commented 2 weeks ago

Closing the issue, please reopen if there are still issues

philmod-h commented 2 weeks ago

@sallyseok Why was this issue closed? The image still doesn't exist:

$ docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.0-gpu-py311
Error response from daemon: manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.0-gpu-py311 not found: manifest unknown: Requested image not found

$ docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311
Error response from daemon: manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311 not found: manifest unknown: Requested image not found
philmod-h commented 2 weeks ago

Opened a new issue as I cannot reopen this one: https://github.com/aws/deep-learning-containers/issues/4404