Open kiukchung opened 1 year ago
I have a few clarifying questions for you @kiukchung.
hey @kace
- Do you specifically want this level of control? Would something like using the Docker :latest tag for CPU and GPU images fit your use case?
Yeah having a pointer to "latest" would be awesome, but not strictly required if its hard to implement. Today we manually update to newer versions of the software (hence the image tags) since we have to run validations that our jobs are compatible with (say) newer versions of torch.
That said, what would be super convenient is the ability to have the library tell us which DL image we "should" be using given infrastructure inputs like: host = "p4d.24xlarge"
and AMI = "$DL_AMI
- From your perspective, is installing and importing a library preferable over making a call to a public API?
Either way works (as long as there is a good way to mock out the service call if I were writing unittests). The only use-case I can think of that would be preferable to have a library (versus a service call) is when I'm running on a sandboxed host (e.g. CI/CD) with no internet egress, but that also implies that I can't pull the DL container so its a moot point to run on sandboxed environments unless I explicitly mock out docker pull $DL_CONTAINER
as well (e.g. unittests).
Thanks for your response @kiukchung!
:latest-gpu
for the GPU processor image with the most up-to-date framework version make sense to you?That said, what would be super convenient is the ability to have the library tell us which DL image we "should" be using given infrastructure inputs like: host = "p4d.24xlarge" and AMI = "$DL_AMI
Not sure I understand what you mean by this. You want to know the recommended container for a specific combination of instance_type and AMI?
- I can't imagine that it would be that difficult to add this tag. Would something like using :latest-gpu for the GPU processor image with the most up-to-date framework version make sense to you?
Yep that makes sense. Were you thinking of making the hardware-type part of the tag (and not part of the image name)? I realize that hw-type is currently part of the tag (and not the image name). Either way is fine, it would just be a different template string to point to the "latest" tag. For instance: pytorch-training-$VER-$HW_TYPE:latest
versus pytorch-training-$VER:latest-$HW_TYPE
.
Not sure I understand what you mean by this. You want to know the recommended container for a specific combination of instance_type and AMI?
Yea this is more for DL-containers for specific device types (hence instance types). For instance if you are using Habana, Graviton, Trainium instances you'd want to chose from a DL container built for those device types. Currently this choice is manual. Would be nice for there to be an API call where given infrastructure-level parametergs (e.g. instance type) I can get a list of DL containers available to use on those instances
>>> import aws.deeplearning_containers as dl_container
>>> dl_container.get_available_containers(instance_type="dl1.24xlarge", framework="pytorch")
[
DLContainer(framework="pytorch", job_type="training", device_type="HPU", python="3.8", url="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training-habana:1.10.2-hpu-py38-synapseai1.4.1-ubuntu20.04"),
DLContainer(framework="pytorch", job_type="training", device_type="HPU", python="3.8", url="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training-habana:1.11.0-hpu-py38-synapseai1.5.0-ubuntu20.04"),
]
The use-case is to select the base docker image to auto-build a user's local-workspace (locally checked out git repo with changes) based on the instance type that the user wants to launch the job onto.
- The reason I suggested an API is that it would be simpler for us to maintain and integrate into our CD. I'd prefer not to have to maintain this data in a dedicated lib.
Makes sense
This would be helpful - a way to programmatically get all supported training/inference containers.
Checklist
Concise Description:
I'm aware that there is a pattern to the DL image urls that point to the ECR registry (e.g.
763104351884.dkr.ecr.{REGION}.amazonaws.com/{IMAGE_NAME}"
. But it would be nice to have a way to programmatically get the URLs for use-cases where we are programmatically generatingDockerfile
or usingdocker.client
(from python) instead of the docker CLI.For instance:
Would print something like:
Is your feature request related to a problem? Please describe.
Makes programmatic usages easier
Describe the solution you'd like
See above for example UX
Describe alternatives you've considered
None of these solutions are great since the URL pattern could change
Additional context N/A