NCCL WARN NET/OFI Request completed with error. RC: 21. Error: unknown error

Hello aws_ofi_nccl maintainers,

Please let me know if this is not the best location to post the issue and I will close this issue.

I am unable to figure out why the process is hanging after the error message is shown.

My training setup: 2 ml.g4dn.12xlarge instances on AWS Sagemaker trying to run distributed training with Pytorch base image 763104351884.dkr.ecr.us-west-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker. The two instances are running inside a private subnet with a NAT gateway attached to the subnet.

All outputs are from host-1 Output of lspci -i efa:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:06.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:07.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)
00:08.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:09.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:1a.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:1b.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1c.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1d.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller

Output of cat /opt/amazon/efa_installed_packages:

# EFA installer version: 1.15.1
# Debug packages installed: no
# Packages installed:
efa-config_1.9_all efa-profile_1.5_all libfabric-aws-bin_1.14.0amzn1.0_amd64 libfabric-aws-dev_1.14.0amzn1.0_amd64 libfabric1-aws_1.14.0amzn1.0_amd64 openmpi40-aws_4.1.2-1_amd64 ibacm_39.0-1_amd64 ibverbs-providers_39.0-1_amd64 ibverbs-utils_39.0-1_amd64 infiniband-diags_39.0-1_amd64 libibmad-dev_39.0-1_amd64 libibmad5_39.0-1_amd64 libibnetdisc-dev_39.0-1_amd64 libibnetdisc5_39.0-1_amd64 libibumad-dev_39.0-1_amd64 libibumad3_39.0-1_amd64 libibverbs-dev_39.0-1_amd64 libibverbs1_39.0-1_amd64 librdmacm-dev_39.0-1_amd64 librdmacm1_39.0-1_amd64 rdma-core_39.0-1_amd64 rdmacm-utils_39.0-1_amd64

Output of /opt/amazon/efa/bin/fi_info -p efa:

provider: efa
    fabric: EFA-fe80::424:a9ff:fed5:b935
    domain: efa_0-rdm
    version: 114.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::424:a9ff:fed5:b935
    domain: efa_0-dgrm
    version: 114.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

Output of training job: distributed training is initialized with nccl backend in pytorch using the mmaction2 training library. I set FI_EFA_USE_DEVICE_RDMA=0 because the T4 gpus do not support RDMA. Also, the cmd is run as os.system() command in the entrypoint passed to sagemaker cmd=

NCCL_SOCKET_IFNAME=eth0 FI_PROVIDER="efa" FI_EFA_USE_DEVICE_RDMA=0 NCCL_DEBUG=INFO FI_LOG_LEVEL=warn FI_LOG_PROV=efa PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.22b20221214-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/flash_attn-0.1-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/einops-0.6.0-py3.8.egg python -m torch.distributed.launch --nnodes=2 --node_rank=0  --master_addr=algo-1  --nproc_per_node=4  --master_port=7777  <train script> <config.py>

algo-1:462:462 [0] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:462:462 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:462:462 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:462:462 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:462:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:462:462 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:462:462 [0] NCCL INFO NET/OFI Selected Provider is efa
algo-1:462:462 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3
algo-1:463:463 [1] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:464:464 [2] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:465:465 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:464:464 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:463:463 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:464:464 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:463:463 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:465:465 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:464:464 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:463:463 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:465:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:465:465 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:465:465 [3] NCCL INFO NET/OFI Selected Provider is efa
algo-1:465:465 [3] NCCL INFO Using network AWS Libfabric
libfabric:463:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
libfabric:464:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:463:463 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:463:463 [1] NCCL INFO NET/OFI Selected Provider is efa
algo-1:463:463 [1] NCCL INFO Using network AWS Libfabric
algo-1:464:464 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:464:464 [2] NCCL INFO NET/OFI Selected Provider is efa
algo-1:464:464 [2] NCCL INFO Using network AWS Libfabric
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:463:557 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
algo-1:464:558 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
algo-1:465:556 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
algo-1:462:555 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
algo-1:462:555 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
algo-1:462:555 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
algo-1:462:555 [0] NCCL INFO Channel 00 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:465:556 [3] NCCL INFO Channel 00 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:465:556 [3] NCCL INFO Channel 01 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Connected all rings
algo-1:462:555 [0] NCCL INFO Channel 01 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:462:555 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:462:555 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Connected all rings
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 0[1b0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] via direct shared memory
libfabric:465:1673041637:efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 21, prov_err: Unknown error -21 (21)
algo-1:465:556 [3] ofi_process_cq:1033 NCCL WARN NET/OFI Request 0x7f6390394d18 completed with error. RC: 21. Error: unknown error. Completed length: 0, Request: { buffer_index: 255, dev: 0, size: 0, state: CREATED, direction: SEND }

I see the same error on the algo-2 instance as well.

Pytorch version and helper output by mmaction2:

2023-01-06 21:47:12,614 - mmaction - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
GPU 0,1,2,3: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.12.1+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.1+cu113
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMAction2: 0.24.1+
------------------------------------------------------------
2023-01-06 21:47:12,614 - mmaction - INFO - Distributed training: True

aws / aws-ofi-nccl

NCCL WARN NET/OFI Request completed with error. RC: 21. Error: unknown error #161