aws / deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
Other
961 stars 446 forks source link

[feature-request] Programmatic way to get image URLs #2732

Open kiukchung opened 1 year ago

kiukchung commented 1 year ago

Checklist

Concise Description:

I'm aware that there is a pattern to the DL image urls that point to the ECR registry (e.g. 763104351884.dkr.ecr.{REGION}.amazonaws.com/{IMAGE_NAME}". But it would be nice to have a way to programmatically get the URLs for use-cases where we are programmatically generating Dockerfile or using docker.client (from python) instead of the docker CLI.

For instance:

import boto3
import aws.deeplearning_containers as dl_container

print(f"Current AWS region: {boto3.Session().region_name}")

print(dl_container.ecr_url)
print(dl_container.torch(version="1.13.1").arch("gpu").os("ubuntu20.04").training)
print(dl_container.torch(version="1.13.1").arch("gpu").os("ubuntu20.04").inference)

Would print something like:

Current AWS region: us-west-2
763104351884.dkr.ecr.us-east-1.amazonaws.com
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-cpu-py39-ubuntu20.04-ec2
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.13.1-cpu-py39-ubuntu20.04-ec2

NOTE: More or less options can be added (e.g. only supports ubuntu so no need for the os option). NOTE: The implementation could be as boring as simple string templates but it'll be nice to hook this into the DL image build CI to have the CI auto generate certain parts of the code based on where the CI job uploaded the docker images and which images it built.

Is your feature request related to a problem? Please describe.

Makes programmatic usages easier

Describe the solution you'd like

See above for example UX

Describe alternatives you've considered

  1. Parse https://github.com/aws/deep-learning-containers/blob/master/available_images.md and write my own library
  2. Hard code the base ECR url and offer a few util methods that formats a string template

None of these solutions are great since the URL pattern could change

Additional context N/A

kace commented 1 year ago

I have a few clarifying questions for you @kiukchung.

  1. Do you specifically want this level of control? Would something like using the Docker :latest tag for CPU and GPU images fit your use case?
  2. From your perspective, is installing and importing a library preferable over making a call to a public API?
kiukchung commented 1 year ago

hey @kace

  1. Do you specifically want this level of control? Would something like using the Docker :latest tag for CPU and GPU images fit your use case?

Yeah having a pointer to "latest" would be awesome, but not strictly required if its hard to implement. Today we manually update to newer versions of the software (hence the image tags) since we have to run validations that our jobs are compatible with (say) newer versions of torch.

That said, what would be super convenient is the ability to have the library tell us which DL image we "should" be using given infrastructure inputs like: host = "p4d.24xlarge" and AMI = "$DL_AMI

  1. From your perspective, is installing and importing a library preferable over making a call to a public API?

Either way works (as long as there is a good way to mock out the service call if I were writing unittests). The only use-case I can think of that would be preferable to have a library (versus a service call) is when I'm running on a sandboxed host (e.g. CI/CD) with no internet egress, but that also implies that I can't pull the DL container so its a moot point to run on sandboxed environments unless I explicitly mock out docker pull $DL_CONTAINER as well (e.g. unittests).

kace commented 1 year ago

Thanks for your response @kiukchung!

  1. I can't imagine that it would be that difficult to add this tag. Would something like using :latest-gpu for the GPU processor image with the most up-to-date framework version make sense to you?

That said, what would be super convenient is the ability to have the library tell us which DL image we "should" be using given infrastructure inputs like: host = "p4d.24xlarge" and AMI = "$DL_AMI

Not sure I understand what you mean by this. You want to know the recommended container for a specific combination of instance_type and AMI?

  1. The reason I suggested an API is that it would be simpler for us to maintain and integrate into our CD. I'd prefer not to have to maintain this data in a dedicated lib.
kiukchung commented 1 year ago
  1. I can't imagine that it would be that difficult to add this tag. Would something like using :latest-gpu for the GPU processor image with the most up-to-date framework version make sense to you?

Yep that makes sense. Were you thinking of making the hardware-type part of the tag (and not part of the image name)? I realize that hw-type is currently part of the tag (and not the image name). Either way is fine, it would just be a different template string to point to the "latest" tag. For instance: pytorch-training-$VER-$HW_TYPE:latest versus pytorch-training-$VER:latest-$HW_TYPE.

Not sure I understand what you mean by this. You want to know the recommended container for a specific combination of instance_type and AMI?

Yea this is more for DL-containers for specific device types (hence instance types). For instance if you are using Habana, Graviton, Trainium instances you'd want to chose from a DL container built for those device types. Currently this choice is manual. Would be nice for there to be an API call where given infrastructure-level parametergs (e.g. instance type) I can get a list of DL containers available to use on those instances

>>> import aws.deeplearning_containers as dl_container

>>> dl_container.get_available_containers(instance_type="dl1.24xlarge", framework="pytorch")
[
  DLContainer(framework="pytorch", job_type="training", device_type="HPU", python="3.8", url="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training-habana:1.10.2-hpu-py38-synapseai1.4.1-ubuntu20.04"),
  DLContainer(framework="pytorch", job_type="training", device_type="HPU", python="3.8", url="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training-habana:1.11.0-hpu-py38-synapseai1.5.0-ubuntu20.04"),
]

The use-case is to select the base docker image to auto-build a user's local-workspace (locally checked out git repo with changes) based on the instance type that the user wants to launch the job onto.

  1. The reason I suggested an API is that it would be simpler for us to maintain and integrate into our CD. I'd prefer not to have to maintain this data in a dedicated lib.

Makes sense

Alex-Wenner-FHR commented 6 months ago

This would be helpful - a way to programmatically get all supported training/inference containers.