Closed VictorJouault closed 7 months ago
I am facing the same issue with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
Experiencing a similar behaviour on 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py3.8-cu111-ubuntu20.04
.
Seems like the shell is not correctly configured in this image.
We're using the above image as a base image for a custom docker container.
In the Dockerfile we specify a default command of the final container via CMD some_command param1 param2
(shell form).
Due to the misconfigured shell in the aws image, running docker run [args] $CONTAINER
will lead to erronous executions of the configured command.
Using CMD ["some_command", "param1", "param2"]
in the Dockerfile (exec form) can avoid the erronous execution but still prints the bash error messages.
Hi, we no longer support PyTorch 1.10 DLCs. We recommend upgrading to later PyTorch DLCs, see available_images.md for more information.
Feel free to reopen the ticket if issue is still observed.
Concise Description: When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on.
DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
Current behavior:
When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on. At the very beginning of the job, the message below is printed (linked to this issue).
This seems to be creating a bug when I try to use the ArgParser to get the hyperparameters to my model. While the ArgParser works with other images, it creates the following bug when using Pytorch images:
Expected behavior:
Additional context: I am mainly looking for help on how to run my model on Sagemaker. Currently, my script requires both MXNet and Pytorch because I am using GluonTS. When using a Pytorch image, I run into this bug. When running an MXNet image, I run into Horovod error even though the image I use (
763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.8.0-gpu-py37-cu110-ubuntu16.04
) should be Horovod compatible.Any suggestion appreciated, thanks!