Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
197
stars
87
forks
source link
Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint #230
Describe the bug At https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L48 and https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L50 some environment variables are set in
os.environ
for NCCL and distributed training.However,
os.environ
is not included when the entrypoint is called at https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L71. Onlytraining_environment.to_env_vars()
is set as theenv_vars
for the entrypoint, essentially discarding theos.environ
vars set in the above 2 lines for NCCL and distributed training.Expected behavior The env vars passed at https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L71 should include the environment variables set for NCCL and distributed training.