Open ntw-au opened 4 weeks ago
This issue also affects the latest PyTorch 2.2 image, 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker-v1.12
This issue does not affect the latest PyTorch 2.1 image, 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker-v1.7
Checklist
Concise Description: The new
pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker-v1.0
GPU image fails to run distributed applications with NCCL due to a version conflict between NCCL and PyTorch.DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker-v1.0
Current behavior: Running our training script within the container produces an error report very similar to the following (
NCCL_DEBUG=INFO
). The code fails at the first collective operation across the cluster. The training script is launched usingtorchrun
with thenccl
backend, 1 node, 1 process per node. The code passes this point when using thegloo
backend.Both our training script (not available publicly) and a minimal test script fail in the same way, both directly on the base DLC, and on our custom container built on top of the DLC.
Crash log from minimal script (copied below) follows:
Minimal script producing this traceback:
Expected behavior: Minimal code above runs as expected and prints
0
Additional context: I reproduced this exact bug and traceback by accident in a
conda
environment outside a container, where the problem was that thepytorch
package and friends were being installed fromconda-forge
rather than thepytorch
channel. In this case it was solved by installingtriton
viapip
instead ofconda
, which allowed thepytorch
package to install from thepytorch
channel correctly.The runtime environment is Windows 11 23H2 with Docker Desktop running on WSL2 and a single CUDA GPU.