Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint - Githubissues

aws / sagemaker-pytorch-training-toolkit

Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.

Apache License 2.0

197 stars 87 forks source link

Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint #230

Closed thecooltechguy closed 3 years ago

thecooltechguy commented 3 years ago

Describe the bug At https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L48 and https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L50 some environment variables are set in os.environ for NCCL and distributed training.

However, os.environ is not included when the entrypoint is called at https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L71. Only training_environment.to_env_vars() is set as the env_vars for the entrypoint, essentially discarding the os.environ vars set in the above 2 lines for NCCL and distributed training.

Expected behavior The env vars passed at https://github.com/aws/sagemaker-pytorch-training-toolkit/blob/88ca48a831bf4f099d4c57f3c18e0ff92fa2b48c/src/sagemaker_pytorch_container/training.py#L71 should include the environment variables set for NCCL and distributed training.

thecooltechguy commented 3 years ago

Actually, nevermind, this does set up the environment vars correctly, there was just a bug on my end that was overriding these variables.