aws / deep-learning-containers

AWS Deep Learning Containers are pre-built Docker images that make it easier to run popular deep learning frameworks and tools on AWS.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html
Other
1.01k stars 463 forks source link

[bug] Using a Pytorch image seems to be causing an ArgParser bug -- "bash: cannot set terminal process group (-1): Inappropriate ioctl for device" #1617

Closed VictorJouault closed 7 months ago

VictorJouault commented 2 years ago

Concise Description: When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on.

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04

Current behavior:

When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on. At the very beginning of the job, the message below is printed (linked to this issue).

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

This seems to be creating a bug when I try to use the ArgParser to get the hyperparameters to my model. While the ArgParser works with other images, it creates the following bug when using Pytorch images:

Traceback (most recent call last):
  File "experiment.py", line 358, in <module>
    args.hyper_params = json.loads(args.hyper_params)
  File "/opt/conda/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/opt/conda/bin/python3.8 experiment.py --data-bucket sagemaker-us-east-1-XXX --data-prefix sample_dataset --estimator CustomEstimator --hyper-params {"prediction_length": 168, "context_length": 672, "trainer_kwargs": {"max_epochs": 200}} --job-config {}"

Expected behavior:

Additional context: I am mainly looking for help on how to run my model on Sagemaker. Currently, my script requires both MXNet and Pytorch because I am using GluonTS. When using a Pytorch image, I run into this bug. When running an MXNet image, I run into Horovod error even though the image I use (763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.8.0-gpu-py37-cu110-ubuntu16.04) should be Horovod compatible.

Any suggestion appreciated, thanks!

dustin-liu-bgl commented 2 years ago

I am facing the same issue with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04

lckr commented 2 years ago

Experiencing a similar behaviour on 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py3.8-cu111-ubuntu20.04. Seems like the shell is not correctly configured in this image.

We're using the above image as a base image for a custom docker container. In the Dockerfile we specify a default command of the final container via CMD some_command param1 param2 (shell form). Due to the misconfigured shell in the aws image, running docker run [args] $CONTAINER will lead to erronous executions of the configured command. Using CMD ["some_command", "param1", "param2"] in the Dockerfile (exec form) can avoid the erronous execution but still prints the bash error messages.

sirutBuasai commented 7 months ago

Hi, we no longer support PyTorch 1.10 DLCs. We recommend upgrading to later PyTorch DLCs, see available_images.md for more information.

Feel free to reopen the ticket if issue is still observed.