aws / deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
Other
954 stars 445 forks source link

[bug] PyTorch 2.3 GPU SageMaker Training container includes an incompatible NCCL version #3964

Open ntw-au opened 4 weeks ago

ntw-au commented 4 weeks ago

Checklist

Concise Description: The new pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker-v1.0 GPU image fails to run distributed applications with NCCL due to a version conflict between NCCL and PyTorch.

DLC image/dockerfile: 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker-v1.0

Current behavior: Running our training script within the container produces an error report very similar to the following (NCCL_DEBUG=INFO). The code fails at the first collective operation across the cluster. The training script is launched using torchrun with the nccl backend, 1 node, 1 process per node. The code passes this point when using the gloo backend.

Both our training script (not available publicly) and a minimal test script fail in the same way, both directly on the base DLC, and on our custom container built on top of the DLC.

Crash log from minimal script (copied below) follows:

``` [rank0]: Traceback (most recent call last): [rank0]: File "//test_distributed.py", line 7, in [rank0]: dist.broadcast(value, 0) [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast [rank0]: work = default_pg.broadcast([tensor], opts) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1714328519311/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5 [rank0]: ncclUnhandledCudaError: Call to CUDA function failed. [rank0]: Last error: [rank0]: Cuda failure 999 'unknown error' 5ae07cd6429f:47:63 [0] include/alloc.h:125 NCCL WARN Cuda failure 500 'named symbol not found' 5ae07cd6429f:47:63 [0] NCCL INFO include/alloc.h:246 -> 1 5ae07cd6429f:47:63 [0] NCCL INFO comm 0x562adb38c000 rank 0 nranks 1 cudaDev 0 busId 21000 - Abort COMPLETE E0529 02:53:45.931000 139875015935808 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 47) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.3.0', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ test_distributed.py FAILED ```

Minimal script producing this traceback:

import torch
import torch.distributed as dist

dist.init_process_group(backend='nccl')

value = torch.tensor([0], dtype=torch.int64, device='cuda:0')
dist.broadcast(value, 0)

print(value)

# torchrun --nnodes 1 --nproc_per_node 1 test_distributed.py

Expected behavior: Minimal code above runs as expected and prints 0

Additional context: I reproduced this exact bug and traceback by accident in a conda environment outside a container, where the problem was that the pytorch package and friends were being installed from conda-forge rather than the pytorch channel. In this case it was solved by installing triton via pip instead of conda, which allowed the pytorch package to install from the pytorch channel correctly.

The runtime environment is Windows 11 23H2 with Docker Desktop running on WSL2 and a single CUDA GPU.

ntw-au commented 3 weeks ago

This issue also affects the latest PyTorch 2.2 image, 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker-v1.12

``` root@9dc3329bb310:/# torchrun --nnodes 1 --nproc_per_node 1 test_distributed.py 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0] 2.2.0 9dc3329bb310:48:48 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0 9dc3329bb310:48:48 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0> 9dc3329bb310:48:48 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol. 9dc3329bb310:48:48 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported. 9dc3329bb310:48:48 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.21.5+cuda12.1 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Using Libfabric version 1.20 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Using CUDA driver version 12020 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Configuring AWS-specific options 9dc3329bb310:48:60 [0] get_platform_type:110 NCCL WARN NET/OFI Error opening file: /sys/devices/virtual/dmi/id/product_name 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Setting provider_filter to efa 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us 9dc3329bb310:48:60 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV 9dc3329bb310:48:60 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol 9dc3329bb310:48:60 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed 9dc3329bb310:48:60 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0 9dc3329bb310:48:60 [0] NCCL INFO NET/IB : No device found. 9dc3329bb310:48:60 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ^docker0 9dc3329bb310:48:60 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0> 9dc3329bb310:48:60 [0] NCCL INFO Using non-device net plugin version 0 9dc3329bb310:48:60 [0] NCCL INFO Using network Socket 9dc3329bb310:48:60 [0] NCCL INFO ncclCommInitRank comm 0x55cd529909c0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 21000 commId 0xd993fd258c6d73c0 - Init START 9dc3329bb310:48:60 [0] NCCL INFO comm 0x55cd529909c0 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 00/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 01/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 02/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 03/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 04/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 05/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 06/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 07/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 08/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 09/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 10/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 11/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 12/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 13/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 14/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 15/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 16/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 17/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 18/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 19/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 20/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 21/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 22/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 23/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 24/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 25/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 26/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 27/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 28/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 29/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 30/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Channel 31/32 : 0 9dc3329bb310:48:60 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 9dc3329bb310:48:60 [0] NCCL INFO P2P Chunksize set to 131072 9dc3329bb310:48:60 [0] include/alloc.h:112 NCCL WARN Cuda failure 999 'unknown error' 9dc3329bb310:48:60 [0] NCCL INFO include/alloc.h:199 -> 1 9dc3329bb310:48:60 [0] NCCL INFO channel.cc:41 -> 1 9dc3329bb310:48:60 [0] NCCL INFO init.cc:526 -> 1 9dc3329bb310:48:60 [0] NCCL INFO init.cc:1259 -> 1 9dc3329bb310:48:60 [0] NCCL INFO init.cc:1548 -> 1 9dc3329bb310:48:60 [0] NCCL INFO group.cc:64 -> 1 [Async thread] 9dc3329bb310:48:48 [0] NCCL INFO group.cc:418 -> 1 9dc3329bb310:48:48 [0] NCCL INFO group.cc:95 -> 1 Traceback (most recent call last): File "//test_distributed.py", line 11, in dist.broadcast(value, 0) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1706743807255/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 999 'unknown error' 9dc3329bb310:48:48 [0] include/alloc.h:125 NCCL WARN Cuda failure 500 'named symbol not found' 9dc3329bb310:48:48 [0] NCCL INFO include/alloc.h:246 -> 1 9dc3329bb310:48:48 [0] NCCL INFO comm 0x55cd529909c0 rank 0 nranks 1 cudaDev 0 busId 21000 - Abort COMPLETE [2024-05-30 23:12:45,728] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 48) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ test_distributed.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-30_23:12:45 host : 9dc3329bb310 rank : 0 (local_rank: 0) exitcode : 1 (pid: 48) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ```
ntw-au commented 3 weeks ago

This issue does not affect the latest PyTorch 2.1 image, 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker-v1.7

``` root@0c8bbd57afdf:/# NCCL_DEBUG=warning torchrun --nnodes 1 --nproc_per_node 1 test_distributed.py 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] 2.1.0 tensor([0], device='cuda:0') root@0c8bbd57afdf:/# ```