Azure MLOps Pipelining -- NCCL WARN [Rem Allocator] Allocation failed & include/alloc.h:48

I am training DOPE (Deep Object Pose Estimation) from NVIDIA on Azure Cluster. I did made it distributed using PyTorch DistributedDataParallel myself using the script/train.py and going from there. Here's the gist for that and I am not sure if it is fully correct (if I have applied DDP) correctly.

Original train.py code: https://github.com/NVlabs/Deep_Object_Pose/blob/master/scripts/train.py DDP version of train.py modified by me: https://gist.github.com/monajalal/5b85c84db7c0f7a0f9250113da994d17

That said, I am using Azure MLOps Pipeline Templates for this task and I am using an Azure GPU cluster with 4 nodes and each node with 4 K80 GPUs. Checking the host-tools.log I do see that GPUs do have a lot of memory left while I get the nccl mem alloc failed and while I do get that message, the training keeps moving forward. It is funny that it happens even if I am setting the batch size for each GPU to 1.

I am using the entire FAT dataset from NVIDIA for training here and the object of interest is cracker box.

I cancelled the job but here's the log before I do so:

2023/05/16 19:24:26 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for sklearn: No module named 'sklearn.utils.testing'
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO P2P plugin IBext
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NET/IB : No device found.
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a02-42ea-000d-3a02-42ea000d3a02 is not a PCI device (vmbus). Attaching to first CPU
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Attribute coll of node net not found
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO CPU/0 (1/1/1)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[5000.0] - NIC/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/100000 (0)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/200000 (1)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/300000 (2)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO + PCI[12.0] - GPU/400000 (3)
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO ==========================================
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) GPU/300000 (3/5.000000/PHB) GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type PHB/PHB, sameChannels 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 6.000000/5.000000, type PHB/PHB, sameChannels 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/8/-1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Tree 1 : 4 -> 0 -> 1/-1/-1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Ring 00 : 15 -> 0 -> 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Ring 01 : 15 -> 0 -> 1
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->4
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 15[400000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 15[400000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 0[100000] -> 1[200000] via direct shared memory
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 0[100000] -> 1[200000] via direct shared memory
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Connected all rings
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 0[100000] -> 4[100000] [send] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 8[100000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 00 : 0[100000] -> 8[100000] [send] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Channel 01 : 4[100000] -> 0[100000] [receive] via NET/Socket/0
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO Connected all trees
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
696b7901f9044f23b786795c0ef7257b000000:39:293 [0] NCCL INFO comm 0x14fc50001240 rank 0 nranks 16 cudaDev 0 busId 100000 - Init COMPLETE
696b7901f9044f23b786795c0ef7257b000000:39:39 [0] NCCL INFO Launch mode Parallel
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
CPython
3.8.13
uname_result(system='Linux', node='696b7901f9044f23b786795c0ef7257b000000', release='5.15.0-1029-azure', version='#36~20.04.1-Ubuntu SMP Tue Dec 6 17:00:26 UTC 2022', machine='x86_64', processor='x86_64')
training script path:  /mnt/azureml/cr/j/29c1643998874ef7b6bfd57858b8c0ea/exe/wd
start: 19:24:26.818894
manual seed set to 4646
opt.checkpoints = /mnt/azureml/cr/j/29c1643998874ef7b6bfd57858b8c0ea/cap/data-capability/wd/checkpoints
world size is:  16
global rank is 0 and local_rank is 0
is_distributed is True and batch_size is 1
os.getpid() is 39 and initializing process group with {'MASTER_ADDR': '10.0.0.4', 'MASTER_PORT': '6105', 'LOCAL_RANK': '0', 'RANK': '0', 'WORLD_SIZE': '16'}
device is cuda:0
MLflow version: 1.25.1
Tracking URI: azureml:URI
Artifact URI: azureml:URI
load data
train data size:  246000
training data len:  246000
batch size is:  1
training data: 15375 batches
load models
torch.cuda.device_count():  4
type opt.gpuids: <class 'list'>
gpuids are: [0, 1, 2, 3]
Training network pretrained on imagenet.

  0%|          | 0.00/548M [00:00<?, ?B/s]

100%|██████████| 548M/548M [00:02<00:00, 259MB/s]

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.4<52638>
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:445 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:457 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:229 -> 2

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)
Train Epoch: 1 [0/246000 (0%)]  Loss: 0.036680415272713
epoch is: 1 and train loss is 0.03668041527271271
...
epoch is: 1 and train loss is 3.1753609164297814e-06
Train Epoch: 1 [2700/246000 (18%)]  Loss: 0.000002425569619
epoch is: 1 and train loss is 2.4255696189356968e-06
...
epoch is: 3 and train loss is 1.1219314899335586e-07
Train Epoch: 3 [5800/246000 (38%)]  Loss: 0.006392419338226

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.0.0.4<44028>
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:445 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO include/socket.h:457 -> 2
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:229 -> 2

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] include/alloc.h:48 NCCL WARN Cuda failure 'out of memory'
696b7901f9044f23b786795c0ef7257b000000:39:298 [0] NCCL INFO bootstrap.cc:231 -> 1

696b7901f9044f23b786795c0ef7257b000000:39:298 [0] bootstrap.cc:279 NCCL WARN [Rem Allocator] Allocation failed (segment 0, fd 130)
epoch is: 3 and train loss is 0.006392419338226318
Train Epoch: 3 [5900/246000 (38%)]  Loss: 0.000000082420129
...
epoch is: 3 and train loss is 1.4848103546682978e-07
Train Epoch: 3 [13100/246000 (85%)] Loss: 0.000000773773365

Here's my Dockerfile:

# check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
FROM nvcr.io/nvidia/pytorch:22.04-py3

##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0

# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
    cd /tmp/nccltests && \
    git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make \
    MPI=1 MPI_HOME=/opt/hpcx/ompi \
    NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    CUDA_HOME=/usr/local/cuda && \
    cp ./build/* /usr/local/bin && \
    rm -rf /tmp/nccltests

# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt

# RUN python -m pip install   azureml-defaults==1.41.0 \
#     mlflow==1.25.1 \
#     azureml-mlflow==1.41.0 \
#     transformers==4.18.0 \
#     psutil==5.9.0

# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"

and here's my requirements.txt that I am using for Azure MLOps Pipeline Templates:

albumentations==1.3.0
ConfigParser==5.3.0
horovod==0.27.0
matplotlib==3.7.0
numpy==1.24.2
nvisii==1.1.72
Pillow==9.4.0
profiling==0.1.3
psutil==5.9.0
pyquaternion==0.9.9
pyrealsense2==2.53.1.4623
pyrender==0.1.45
pyrr==0.10.3
PyYAML==6.0
scipy==1.10.1
seaborn==0.12.2
simplejson==3.18.4
tensorboardX==2.6
torchvision==0.12.0
torch==1.11.0
tqdm==4.64.1
opencv-python-headless==4.1.2.30
transformers==4.18.0
mlflow==1.25.1
azureml-mlflow==1.41.0

These may come from services or network scanners connecting to your job and sending random data. Note recent versions of NCCL have changed that code a lot so they may avoid these kind of issues; I'd still make sure the interface used for NCCL bootstrap (through NCCL_SOCKET_IFNAME) is set to a private network address which is not accessible from the outside.

@sjeaugey Hi Sylvain, thanks a lot for your response.

I am the only one running experiment in this workspace though my company has enterprise contract with Microsoft and there may be other people using deep learning but not the clusters I made. So, I am not sure what you meant exactly by the first part of your answer and how I diagnose that?

Also, is setting it like this wrong? ENV NCCL_SOCKET_IFNAME="eth0"

If not, I am not sure what I should set it to. Could you please help me with that?

Regarding newer versions of NCCL, here is my mlops --> azureml --> train --> train-env.yaml file:

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: nvidia_pytorch
build:
  path: ../../../data-science/environment/
tags:
  os: ubuntu
  os_version: 20.04
  hpcx: 2.10
  mpi: openmpi
  mpi_version: 4.1.2rc4
  ucx: 1.12.0
  cuda: 11.6.2
  cudnn: 8.4.0.27
  nccl: 2.12.10
  rdma_core: 36.0
  nsight_compute: 2022.1.1.2
  nsight_systems: "2022.2.1.31-5fe97ab"
  nccl_test: 2.11.0
  azureml-defaults: 1.41.0
  mlflow: 1.25.1
  transformers: 4.18.0

What should I change the nccl version to? Any other thing I should change in this file?

I'm not sure what other versions are available, but we're currently at NCCL 2.18. It's is fairly recent so I'm not sure it would be available, but 2.17.1 or 2.16.5 could be good versions to try.

I don't know how the workspace is setup and whether eth0 is accessible from the outside. But if it is, it should not be used for NCCL, as NCCL opens ports for inter-rank communication which are not secured. Most of the time when we see this kind of errors, it's due to services trying to open ports on all servers they can access and finding that NCCL open port and making that service fail. It should not break the application though, just generate WARNs.

Now if you think this is likely not the case, it could simply be that those calls are made by NCCL for a good reason, and you're just running out of memory. You may try setting NCCL_P2P_PXN_LEVEL=0 to avoid those specific memory allocations; performance of small size alltoall operations might be affected negatively though (if that matters to your use case).

If you're still getting an out of memory issue after disabling PXN for P2P, then it could be that your application is not keeping enough memory for NCCL to operate. There might be a way to tell your application to use less memory or give more space for libraries. Recent versions of NCCL may also allocate all that memory during init (instead of during the first send/receive operation) which may work better with the framework. Which is another reason to try a newer version.

@sjeaugey thanks a lot for your followup response.

Do I only need to change the NCCL version in the train-env.yaml below or do I need to change other things as well? For example, do I need to change the python, pytorch, rdma_core, cuda, nsight_compute, nsight_systems, or nccl_test as well below? Is there a comptability matrix I could follow? Let's say I want to use nccl 2.17.1.

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: nvidia_pytorch
build:
  path: ../../../data-science/environment/
tags:
  os: ubuntu
  os_version: 20.04
  hpcx: 2.10
  mpi: openmpi
  mpi_version: 4.1.2rc4
  ucx: 1.12.0
  cuda: 11.6.2
  cudnn: 8.4.0.27
  nccl: 2.12.10
  rdma_core: 36.0
  nsight_compute: 2022.1.1.2
  nsight_systems: "2022.2.1.31-5fe97ab"
  nccl_test: 2.11.0
  # azureml-defaults: 1.41.0
  # mlflow: 1.25.1
  azureml-defaults: 1.50.0
  mlflow: 2.3.2
  transformers: 4.18.0

@sjeaugey Also, is there a suggested version of torch and torchvision and cuda to use for nccl==2.17.1?

I have these now

 # for local testing (cpu)
torchvision==0.12.0
torch==1.11.0
transformers==4.18.0

# for metrics reporting/plotting
# mlflow==1.25.1
# azureml-mlflow==1.41.0
mlflow==2.3.2
azureml-mlflow==1.50.0
matplotlib==3.5.2
tqdm==4.64.0
psutil==5.9.0

# for unit testing
pytest==7.1.2

I haven't tried any of that myself. All I can say is that NCCL is backwards compatible so if you can replace NCCL by a newer version it should work.

If that's problematic, at least you can try running with NCCL_P2P_PXN_LEVEL=0. Maybe it will solve the problem.

@sjeaugey Hi Sylvain, you are correct. I just changed the NCCL version to 2.17.1 and it worked. However, I am getting a network error. Is this also related to NCCL? I am using env:// for dist_url

5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO P2P plugin IBext
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NET/IB : No device found.
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO Using network Socket
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/6045bd7e-4468-6045-bd7e-44686045bd7e is not a PCI device (vmbus). Attaching to first CPU
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Attribute coll of node net not found
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO CPU/0 (1/1/1)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[5000.0] - NIC/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/100000 (4)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/200000 (5)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/300000 (6)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/400000 (7)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO ==========================================
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) GPU/300000 (3/5.000000/PHB) GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type PHB/PHB, sameChannels 1
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 6.000000/5.000000, type PHB/PHB, sameChannels 1
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Ring 00 : 6 -> 7 -> 8
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Ring 01 : 6 -> 7 -> 8
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Setting affinity for GPU 3 to 0fff
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 00 : 7[400000] -> 8[100000] [send] via NET/Socket/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 01 : 7[400000] -> 8[100000] [send] via NET/Socket/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Connected all rings
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 00 : 7[400000] -> 6[300000] via direct shared memory
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 01 : 7[400000] -> 6[300000] via direct shared memory
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Connected all trees
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO comm 0x14ec70001240 rank 7 nranks 16 cudaDev 3 busId 400000 - Init COMPLETE
NCCL version is:  (2, 10, 3)
MLflow version: 2.3.2
Tracking URI: azureml:URI
Artifact URI: azureml:URI
World size: 16
local rank is 3 and world rank is 7
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_7/cifar-10-python.tar.gz
Failed download. Trying https -> http instead. Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_7/cifar-10-python.tar.gz

  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 509952/170498071 [00:00<00:34, 4973408.07it/s]
  3%|▎         | 5369856/170498071 [00:00<00:05, 30292895.97it/s]
  6%|▌         | 10510336/170498071 [00:00<00:04, 39722223.67it/s]
  9%|▉         | 15607808/170498071 [00:00<00:03, 44134683.65it/s]
 12%|█▏        | 20916224/170498071 [00:00<00:03, 47346544.93it/s]
 15%|█▌        | 25932800/170498071 [00:00<00:02, 48219144.33it/s]
 18%|█▊        | 31144960/170498071 [00:00<00:02, 49478225.78it/s]
 21%|██        | 36150272/170498071 [00:00<00:02, 49594746.51it/s]
 24%|██▍       | 41290752/170498071 [00:00<00:02, 50114658.72it/s]
 27%|██▋       | 46394368/170498071 [00:01<00:02, 50398280.09it/s]
 30%|███       | 51577856/170498071 [00:01<00:02, 50816752.04it/s]
 33%|███▎      | 56827904/170498071 [00:01<00:02, 51320451.98it/s]
 36%|███▋      | 61961216/170498071 [00:01<00:02, 50976453.02it/s]
 39%|███▉      | 67060736/170498071 [00:01<00:02, 50287582.50it/s]
 42%|████▏     | 72092672/170498071 [00:01<00:01, 50109766.08it/s]
 45%|████▌     | 77106176/170498071 [00:01<00:01, 49095767.34it/s]
 48%|████▊     | 82021376/170498071 [00:01<00:01, 48671364.32it/s]
 51%|█████     | 86892544/170498071 [00:01<00:01, 47545196.01it/s]
 54%|█████▍    | 91653120/170498071 [00:01<00:01, 47161788.29it/s]
 57%|█████▋    | 96468992/170498071 [00:02<00:01, 47452315.59it/s]
 59%|█████▉    | 101244928/170498071 [00:02<00:01, 47427661.77it/s]
 62%|██████▏   | 105990144/170498071 [00:02<00:01, 47410171.20it/s]
 65%|██████▍   | 110733312/170498071 [00:02<00:01, 46584731.58it/s]
 68%|██████▊   | 115395584/170498071 [00:02<00:01, 44587269.02it/s]
 70%|███████   | 119872512/170498071 [00:02<00:01, 43861346.97it/s]
 73%|███████▎  | 124271616/170498071 [00:02<00:01, 42668794.97it/s]
 75%|███████▌  | 128549888/170498071 [00:02<00:00, 42252685.08it/s]
 80%|███████▉  | 135838720/170498071 [00:02<00:00, 51047603.65it/s]
 85%|████████▍ | 144263168/170498071 [00:02<00:00, 60702249.91it/s]
 90%|████████▉ | 152733696/170498071 [00:03<00:00, 67747468.16it/s]
 94%|█████████▍| 161063936/170498071 [00:03<00:00, 72342864.86it/s]
 99%|█████████▉| 169494528/170498071 [00:03<00:00, 75891572.88it/s]
170499072it [00:03, 51834695.76it/s]                              
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
Extracting ./data_7/cifar-10-python.tar.gz to ./data_7
Files already downloaded and verified
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/opt/conda/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/opt/conda/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/opt/conda/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/opt/conda/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/opt/conda/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 93, in <module>
    model = torchvision.models.resnet50(pretrained=True)
  File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 331, in resnet50
    return _resnet("resnet50", Bottleneck, [3, 4, 6, 3], pretrained, progress, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 296, in _resnet
    state_dict = load_state_dict_from_url(model_urls[arch], progress=progress)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 591, in load_state_dict_from_url
    download_url_to_file(url, cached_file, hash_prefix, progress=progress)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 457, in download_url_to_file
    u = urlopen(req)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/opt/conda/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/opt/conda/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

That doesn't seem related to NCCL; looks like a network issue?

a friend told me it could be because I have set dist_url for init_process_group to env:// but I used that based off pytorch doc for DDP. He said I may have to use file_based URI instead. I am also using eth0 I am not sure.

This is my Dockerfile. Is eth0 wrong for NCCL_SOCKET_IFNAME? I am not sure how I should figure the right value for my cluster.

# check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
FROM nvcr.io/nvidia/pytorch:22.04-py3

##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0

# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
    cd /tmp/nccltests && \
    git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make \
    MPI=1 MPI_HOME=/opt/hpcx/ompi \
    NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    CUDA_HOME=/usr/local/cuda && \
    cp ./build/* /usr/local/bin && \
    rm -rf /tmp/nccltests

# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt

# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"
ENV NCCL_P2P_PXN_LEVEL="0"
# ENV NCCL_SOCKET_IFNAME='lo'
ENV NCCL_IB_DISABLE="1"

I also have this in my pipeline.yaml file:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

outputs:
  checkpoints:
    type: uri_folder
    mode: upload
    path: azureml:cifar10_resnet50_model:1

# <jobs>
settings:
  default_datastore: azureml:arddatalake
  continue_on_step_failure: true

jobs:
  train:
    type: command
    component: file:train.yaml
    compute: azureml:multinode-multigpu16
    resources:
      instance_count: 4 #number of nodes
    distribution:
      type: pytorch
      process_count_per_instance: 4 #number of GPUs per instance

    # NOTE: set env var if needed
    environment_variables:
      NCCL_DEBUG: "INFO" # adjusts the level of info from NCCL tests

      # NCCL_TOPO_FILE: "/opt/microsoft/ndv4-topo.xml" # Use specific topology file for A100

      # NCCL_IB_PCI_RELAXED_ORDERING: "1" # Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
      NCCL_IB_DISABLE: "1" # force disable infiniband (if set to "1")
      # NCCL_NET_PLUGIN: "none" # to force NET/Plugin off (no rdma/sharp plugin at all)
      # NCCL_NET: "Socket" # to force node-to-node comm to use Socket (slow)
      NCCL_SOCKET_IFNAME: "eth0" # to force Socket comm to use eth0 (use NCCL_NET=Socket)
      # NCCL_SOCKET_IFNAME: "lo"

      # UCX_IB_PCI_RELAXED_ORDERING: "on"
      # UCX_TLS: "tcp"
      # UCX_NET_DEVICES: "eth0" # if you have Error: Failed to resolve UCX endpoint...
      NCCL_P2P_PXN_LEVEL: "0"

      CUDA_DEVICE_ORDER: "PCI_BUS_ID" # ordering of gpus  # do we need to uncomment this? why?

      TORCH_DISTRIBUTED_DEBUG: "DETAIL"

# </jobs>

The IP interfaces can be listed using ifconfig or ip -4 addr show scope global (remove -4 if it returns nothing).

I will get to you on that soon. One thing I noticed in the Dockerfile that is automatically produced for us, the topology xml file is said to be uncommented only if we are using V100. Do I still need to find a correct xml topology file for 4 nodes each having 4 K80 GPUs? I am specifically talking about these lines below from the Dockerfile:

# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

Dockerfile

# check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
FROM nvcr.io/nvidia/pytorch:22.04-py3

##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0

# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
    cd /tmp/nccltests && \
    git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make \
    MPI=1 MPI_HOME=/opt/hpcx/ompi \
    NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    CUDA_HOME=/usr/local/cuda && \
    cp ./build/* /usr/local/bin && \
    rm -rf /tmp/nccltests

# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt

# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"
ENV NCCL_P2P_PXN_LEVEL="0"
# ENV NCCL_SOCKET_IFNAME='lo'
ENV NCCL_IB_DISABLE="1"

@sjeaugey Sylvain, here's what I see, so was eth0 correct based on that? Not sure then why I get network error in multinode multiGPU setting with DDP when backend is NCCL

azureuser@NUMBER:~$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet NUMBER  netmask NUMBER  broadcast NUMBER
        ether 02:42:34:49:0b:f1  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet NUMBER  netmask NUMBER  broadcast NUMBER
        NUMBER  prefixlen 64  scopeid 0x20<link>
        ether NUMBER  txqueuelen 1000  (Ethernet)
        RX packets 6663965  bytes 9332760626 (9.3 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 131910  bytes 13157125 (13.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet NUMBER  netmask NUMBER
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 3509  bytes 671319 (671.3 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3509  bytes 671319 (671.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

azureuser@NUMBER:~$ ip -4 addr show scope global
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    inet NUMBER/19 brd NUMBER scope global eth0
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    inet NUMBER/16 brd NUMBER scope global docker0
       valid_lft forever preferred_lft forever

@sjeaugey I set the NCCL version to 2.17.1. It is printed as NCCL version 2.10.3+cuda10.2 However, I see it printed as a different version by python when using print("NCCL version is: ", torch.cuda.nccl.version())

b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.4<0>
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO P2P plugin IBext
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO NET/IB : No device found.
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0>
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/6045bd85-1d09-6045-bd85-1d096045bd85 is not a PCI device (vmbus). Attaching to first CPU
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Attribute coll of node net not found
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO CPU/0 (1/1/1)
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO + PCI[5000.0] - NIC/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO + PCI[12.0] - GPU/100000 (0)
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO + PCI[12.0] - GPU/200000 (1)
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO + PCI[12.0] - GPU/300000 (2)
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO + PCI[12.0] - GPU/400000 (3)
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO ==========================================
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) GPU/300000 (3/5.000000/PHB) GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type PHB/PHB, sameChannels 1
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 6.000000/5.000000, type PHB/PHB, sameChannels 1
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/8/-1
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Tree 1 : 4 -> 0 -> 1/-1/-1
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Ring 00 : 15 -> 0 -> 1
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Ring 01 : 15 -> 0 -> 1
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->4
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 00 : 15[400000] -> 0[100000] [receive] via NET/Socket/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 01 : 15[400000] -> 0[100000] [receive] via NET/Socket/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 00 : 0[100000] -> 1[200000] via direct shared memory
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 01 : 0[100000] -> 1[200000] via direct shared memory
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Connected all rings
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 01 : 0[100000] -> 4[100000] [send] via NET/Socket/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 00 : 8[100000] -> 0[100000] [receive] via NET/Socket/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 00 : 0[100000] -> 8[100000] [send] via NET/Socket/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Channel 01 : 4[100000] -> 0[100000] [receive] via NET/Socket/0
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO Connected all trees
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
b2f0020da4b4453d9bc7352c13cfafd2000000:37:208 [0] NCCL INFO comm 0x14b4e4001240 rank 0 nranks 16 cudaDev 0 busId 100000 - Init COMPLETE
b2f0020da4b4453d9bc7352c13cfafd2000000:37:37 [0] NCCL INFO Launch mode Parallel
NCCL version is:  (2, 10, 3)
MLflow version: 2.3.2
Tracking URI: URI
Artifact URI: URI
World size: 16
local rank is 0 and world rank is 0
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_0/cifar-10-python.tar.gz

  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 526336/170498071 [00:00<00:32, 5214111.72it/s]
  1%|          | 1600512/170498071 [00:00<00:20, 8297514.35it/s]
  1%|▏         | 2427904/170498071 [00:00<00:21, 7899571.18it/s]
  2%|▏         | 3218432/170498071 [00:00<00:21, 7618532.17it/s]
  2%|▏         | 3981312/170498071 [00:00<00:22, 7503681.12it/s]
  3%|▎         | 4731904/170498071 [00:00<00:22, 7371932.78it/s]
  3%|▎         | 5469184/170498071 [00:00<00:22, 7323469.85it/s]
  4%|▎         | 6208512/170498071 [00:00<00:22, 7262929.98it/s]
  4%|▍         | 6944768/170498071 [00:00<00:22, 7254785.38it/s]
  5%|▍         | 7672832/170498071 [00:01<00:22, 7217119.14it/s]
  5%|▍         | 8416256/170498071 [00:01<00:22, 7232000.68it/s]
  5%|▌         | 9152512/170498071 [00:01<00:22, 7239764.21it/s]
  6%|▌         | 9876480/170498071 [00:01<00:22, 7228674.02it/s]
  6%|▌         | 10600448/170498071 [00:01<00:22, 7226070.79it/s]
  7%|▋         | 11328512/170498071 [00:01<00:22, 7216731.73it/s]
  7%|▋         | 12071936/170498071 [00:01<00:21, 7274857.47it/s]
  8%|▊         | 12808192/170498071 [00:01<00:21, 7260905.17it/s]
  8%|▊         | 13535232/170498071 [00:01<00:21, 7260775.68it/s]
  8%|▊         | 14262272/170498071 [00:01<00:21, 7222496.13it/s]
  9%|▉         | 14985216/170498071 [00:02<00:21, 7204296.41it/s]
  9%|▉         | 15728640/170498071 [00:02<00:21, 7229268.87it/s]
 10%|▉         | 16456704/170498071 [00:02<00:21, 7220415.45it/s]
 10%|█         | 17200128/170498071 [00:02<00:21, 7224961.28it/s]
 11%|█         | 17936384/170498071 [00:02<00:21, 7183842.32it/s]
 11%|█         | 18672640/170498071 [00:02<00:21, 7191643.10it/s]
 11%|█▏        | 19408896/170498071 [00:02<00:21, 7192681.95it/s]
 12%|█▏        | 20152320/170498071 [00:02<00:20, 7200915.86it/s]
 12%|█▏        | 20873216/170498071 [00:02<00:20, 7191556.14it/s]
 13%|█▎        | 21593088/170498071 [00:02<00:20, 7147805.89it/s]
 13%|█▎        | 22328320/170498071 [00:03<00:20, 7176455.75it/s]
 14%|█▎        | 23046144/170498071 [00:03<00:20, 7166879.75it/s]
 14%|█▍        | 23762944/170498071 [00:03<00:20, 7117988.01it/s]
 14%|█▍        | 24475648/170498071 [00:03<00:20, 7114257.15it/s]
 15%|█▍        | 25208832/170498071 [00:03<00:20, 7158553.16it/s]
 15%|█▌        | 25944064/170498071 [00:03<00:20, 7158459.92it/s]
 16%|█▌        | 26660864/170498071 [00:03<00:20, 7085321.87it/s]
 16%|█▌        | 27392000/170498071 [00:03<00:20, 7115919.25it/s]
 16%|█▋        | 28103680/170498071 [00:03<00:20, 7077021.30it/s]
 17%|█▋        | 28812288/170498071 [00:03<00:20, 6984772.87it/s]
 17%|█▋        | 29511680/170498071 [00:04<00:20, 6917794.04it/s]
 18%|█▊        | 30208000/170498071 [00:04<00:20, 6912855.50it/s]
 18%|█▊        | 30936064/170498071 [00:04<00:20, 6950632.46it/s]
 19%|█▊        | 31640576/170498071 [00:04<00:19, 6974672.76it/s]
 19%|█▉        | 32338944/170498071 [00:04<00:19, 6959728.21it/s]
 19%|█▉        | 33035264/170498071 [00:04<00:20, 6760558.21it/s]
 20%|█▉        | 33728512/170498071 [00:04<00:20, 6790631.41it/s]
 20%|██        | 34408448/170498071 [00:04<00:20, 6722066.47it/s]
 21%|██        | 35082240/170498071 [00:04<00:20, 6693081.24it/s]
 21%|██        | 35776512/170498071 [00:05<00:19, 6766166.75it/s]
 21%|██▏       | 36454400/170498071 [00:05<00:19, 6742320.18it/s]
 22%|██▏       | 37129216/170498071 [00:05<00:19, 6692752.38it/s]
 22%|██▏       | 37808128/170498071 [00:05<00:19, 6712944.65it/s]
 23%|██▎       | 38504448/170498071 [00:05<00:19, 6767644.42it/s]
 23%|██▎       | 39192576/170498071 [00:05<00:19, 6800148.60it/s]
 23%|██▎       | 39873536/170498071 [00:05<00:19, 6782163.44it/s]
 24%|██▍       | 40552448/170498071 [00:05<00:19, 6717708.35it/s]
 24%|██▍       | 41240576/170498071 [00:05<00:19, 6765990.76it/s]
 25%|██▍       | 41917440/170498071 [00:05<00:19, 6754696.95it/s]
 25%|██▍       | 42593280/170498071 [00:06<00:18, 6736310.97it/s]
 25%|██▌       | 43267072/170498071 [00:06<00:19, 6682078.71it/s]
 26%|██▌       | 43943936/170498071 [00:06<00:19, 6659677.94it/s]
 26%|██▌       | 44610560/170498071 [00:06<00:18, 6652671.93it/s]
 27%|██▋       | 45296640/170498071 [00:06<00:18, 6668885.73it/s]
 27%|██▋       | 45964288/170498071 [00:06<00:18, 6660539.56it/s]
 27%|██▋       | 46630912/170498071 [00:06<00:18, 6558533.45it/s]
 28%|██▊       | 47344640/170498071 [00:06<00:18, 6665445.74it/s]
 28%|██▊       | 48024576/170498071 [00:06<00:18, 6653275.46it/s]
 29%|██▊       | 48728064/170498071 [00:06<00:18, 6714822.33it/s]
 29%|██▉       | 49432576/170498071 [00:07<00:17, 6770501.07it/s]
 29%|██▉       | 50160640/170498071 [00:07<00:17, 6888807.82it/s]
 30%|██▉       | 50849792/170498071 [00:07<00:17, 6856276.51it/s]
 30%|███       | 51544064/170498071 [00:07<00:17, 6804540.83it/s]
 31%|███       | 52280320/170498071 [00:07<00:17, 6888612.30it/s]
 31%|███       | 52969472/170498071 [00:07<00:17, 6844725.33it/s]
 31%|███▏      | 53664768/170498071 [00:07<00:17, 6748790.41it/s]
 32%|███▏      | 54392832/170498071 [00:07<00:16, 6829857.96it/s]
 32%|███▏      | 55076864/170498071 [00:07<00:17, 6771210.05it/s]
 33%|███▎      | 55776256/170498071 [00:07<00:16, 6830901.93it/s]
 33%|███▎      | 56460288/170498071 [00:08<00:16, 6803942.22it/s]
 34%|███▎      | 57160704/170498071 [00:08<00:16, 6849031.86it/s]
 34%|███▍      | 57856000/170498071 [00:08<00:16, 6805016.28it/s]
 34%|███▍      | 58536960/170498071 [00:08<00:16, 6744709.24it/s]
 35%|███▍      | 59240448/170498071 [00:08<00:16, 6785962.46it/s]
 35%|███▌      | 59919360/170498071 [00:08<00:16, 6749209.58it/s]
 36%|███▌      | 60595200/170498071 [00:08<00:16, 6533285.40it/s]
 36%|███▌      | 61351936/170498071 [00:08<00:16, 6804646.30it/s]
 36%|███▋      | 62034944/170498071 [00:08<00:16, 6755095.18it/s]
 37%|███▋      | 62728192/170498071 [00:09<00:15, 6773834.76it/s]
 37%|███▋      | 63407104/170498071 [00:09<00:15, 6764767.26it/s]
 38%|███▊      | 64084992/170498071 [00:09<00:15, 6754754.46it/s]
 38%|███▊      | 64768000/170498071 [00:09<00:15, 6742808.12it/s]
 38%|███▊      | 65456128/170498071 [00:09<00:15, 6725182.65it/s]
 39%|███▉      | 66128896/170498071 [00:09<00:15, 6681823.52it/s]
 39%|███▉      | 66832384/170498071 [00:09<00:15, 6722354.49it/s]
 40%|███▉      | 67505152/170498071 [00:09<00:15, 6719594.76it/s]
 40%|████      | 68208640/170498071 [00:09<00:15, 6773897.44it/s]
 40%|████      | 68886528/170498071 [00:09<00:15, 6764703.88it/s]
 41%|████      | 69563392/170498071 [00:10<00:14, 6739438.54it/s]
 41%|████      | 70238208/170498071 [00:10<00:15, 6629119.76it/s]
 42%|████▏     | 70928384/170498071 [00:10<00:14, 6704064.19it/s]
 42%|████▏     | 71616512/170498071 [00:10<00:14, 6754960.86it/s]
 42%|████▏     | 72312832/170498071 [00:10<00:14, 6816120.19it/s]
 43%|████▎     | 73016320/170498071 [00:10<00:14, 6825642.33it/s]
 43%|████▎     | 73752576/170498071 [00:10<00:13, 6944854.75it/s]
 44%|████▎     | 74488832/170498071 [00:10<00:13, 7002317.46it/s]
 44%|████▍     | 75192320/170498071 [00:10<00:13, 6947571.47it/s]
 45%|████▍     | 75887616/170498071 [00:10<00:13, 6939157.77it/s]
 45%|████▍     | 76581888/170498071 [00:11<00:13, 6916643.35it/s]
 45%|████▌     | 77304832/170498071 [00:11<00:13, 6945177.26it/s]
 46%|████▌     | 78040064/170498071 [00:11<00:13, 7014702.40it/s]
 46%|████▌     | 78741504/170498071 [00:11<00:13, 6999580.79it/s]
 47%|████▋     | 79441920/170498071 [00:11<00:13, 6978626.92it/s]
 47%|████▋     | 80152576/170498071 [00:11<00:12, 6963628.22it/s]
 47%|████▋     | 80864256/170498071 [00:11<00:12, 6989843.43it/s]
 48%|████▊     | 81568768/170498071 [00:11<00:12, 6922057.46it/s]
 48%|████▊     | 82280448/170498071 [00:11<00:12, 6975532.96it/s]
 49%|████▊     | 83016704/170498071 [00:11<00:12, 7046941.63it/s]
 49%|████▉     | 83744768/170498071 [00:12<00:12, 7104135.09it/s]
 50%|████▉     | 84455424/170498071 [00:12<00:12, 7102734.79it/s]
 50%|████▉     | 85166080/170498071 [00:12<00:12, 6960279.78it/s]
 50%|█████     | 85896192/170498071 [00:12<00:12, 7020625.25it/s]
 51%|█████     | 86624256/170498071 [00:12<00:11, 7053917.68it/s]
 51%|█████     | 87360512/170498071 [00:12<00:11, 7120537.10it/s]
 52%|█████▏    | 88073216/170498071 [00:12<00:11, 6986815.08it/s]
 52%|█████▏    | 88772608/170498071 [00:12<00:11, 6978893.67it/s]
 52%|█████▏    | 89470976/170498071 [00:12<00:11, 6770561.77it/s]
 53%|█████▎    | 90279936/170498071 [00:12<00:11, 7137340.07it/s]
 53%|█████▎    | 90995712/170498071 [00:13<00:11, 7097719.66it/s]
 54%|█████▍    | 91707392/170498071 [00:13<00:11, 6999146.78it/s]
 54%|█████▍    | 92408832/170498071 [00:13<00:11, 6996367.11it/s]
 55%|█████▍    | 93109248/170498071 [00:13<00:11, 6955006.27it/s]
 55%|█████▌    | 93840384/170498071 [00:13<00:10, 6985534.77it/s]
 55%|█████▌    | 94544896/170498071 [00:13<00:10, 6965931.50it/s]
 56%|█████▌    | 95256576/170498071 [00:13<00:10, 6959058.44it/s]
 56%|█████▋    | 95992832/170498071 [00:13<00:10, 7072337.10it/s]
 57%|█████▋    | 96700416/170498071 [00:13<00:10, 7021904.21it/s]
 57%|█████▋    | 97402880/170498071 [00:14<00:10, 7018686.63it/s]
 58%|█████▊    | 98112512/170498071 [00:14<00:10, 7040143.44it/s]
 58%|█████▊    | 98840576/170498071 [00:14<00:10, 7051420.13it/s]
 58%|█████▊    | 99546112/170498071 [00:14<00:10, 7032843.04it/s]
 59%|█████▉    | 100280320/170498071 [00:14<00:09, 7087402.65it/s]
 59%|█████▉    | 100989952/170498071 [00:14<00:09, 7051335.69it/s]
 60%|█████▉    | 101720064/170498071 [00:14<00:09, 7087687.94it/s]
 60%|██████    | 102432768/170498071 [00:14<00:09, 7097196.48it/s]
 60%|██████    | 103143424/170498071 [00:14<00:09, 7029823.02it/s]
 61%|██████    | 103864320/170498071 [00:14<00:09, 7031604.53it/s]
 61%|██████▏   | 104576000/170498071 [00:15<00:09, 7021274.15it/s]
 62%|██████▏   | 105312256/170498071 [00:15<00:09, 7081604.17it/s]
 62%|██████▏   | 106022912/170498071 [00:15<00:09, 7088842.69it/s]
 63%|██████▎   | 106752000/170498071 [00:15<00:08, 7095989.20it/s]
 63%|██████▎   | 107472896/170498071 [00:15<00:08, 7129155.64it/s]
 63%|██████▎   | 108186624/170498071 [00:15<00:08, 7131296.51it/s]
 64%|██████▍   | 108904448/170498071 [00:15<00:08, 7101433.30it/s]
 64%|██████▍   | 109640704/170498071 [00:15<00:08, 7075764.98it/s]
 65%|██████▍   | 110352384/170498071 [00:15<00:08, 7080037.32it/s]
 65%|██████▌   | 111112192/170498071 [00:15<00:08, 7212149.28it/s]
 66%|██████▌   | 111834112/170498071 [00:16<00:08, 7124849.81it/s]
 66%|██████▌   | 112547840/170498071 [00:16<00:08, 7054407.29it/s]
 66%|██████▋   | 113254400/170498071 [00:16<00:08, 6922330.31it/s]
 67%|██████▋   | 113947648/170498071 [00:16<00:08, 6853657.32it/s]
 67%|██████▋   | 114640896/170498071 [00:16<00:08, 6859841.53it/s]
 68%|██████▊   | 115344384/170498071 [00:16<00:08, 6856043.21it/s]
 68%|██████▊   | 116048896/170498071 [00:16<00:07, 6867974.56it/s]
 68%|██████▊   | 116752384/170498071 [00:16<00:07, 6895477.37it/s]
 69%|██████▉   | 117456896/170498071 [00:16<00:07, 6912354.22it/s]
 69%|██████▉   | 118183936/170498071 [00:16<00:07, 7017753.77it/s]
 70%|██████▉   | 118886400/170498071 [00:17<00:07, 6989311.61it/s]
 70%|███████   | 119585792/170498071 [00:17<00:07, 6767990.19it/s]
 71%|███████   | 120264704/170498071 [00:17<00:07, 6763030.29it/s]
 71%|███████   | 120942592/170498071 [00:17<00:07, 6765397.29it/s]
 71%|███████▏  | 121620480/170498071 [00:17<00:07, 6750615.02it/s]
 72%|███████▏  | 122296320/170498071 [00:17<00:07, 6585913.27it/s]
 72%|███████▏  | 123040768/170498071 [00:17<00:06, 6822581.23it/s]
 73%|███████▎  | 123724800/170498071 [00:17<00:06, 6728100.93it/s]
 73%|███████▎  | 124399616/170498071 [00:17<00:07, 6531049.90it/s]
 73%|███████▎  | 125054976/170498071 [00:18<00:07, 6438673.31it/s]
 74%|███████▎  | 125700096/170498071 [00:18<00:06, 6430657.98it/s]
 74%|███████▍  | 126344192/170498071 [00:18<00:06, 6378791.81it/s]
 74%|███████▍  | 126983168/170498071 [00:18<00:06, 6350798.32it/s]
 75%|███████▍  | 127619072/170498071 [00:18<00:06, 6306394.89it/s]
 75%|███████▌  | 128280576/170498071 [00:18<00:06, 6316811.85it/s]
 76%|███████▌  | 128912384/170498071 [00:18<00:06, 6316015.67it/s]
 76%|███████▌  | 129544192/170498071 [00:18<00:06, 6190909.90it/s]
 76%|███████▋  | 130163712/170498071 [00:18<00:06, 6149687.60it/s]
 77%|███████▋  | 130779136/170498071 [00:18<00:06, 6097273.60it/s]
 77%|███████▋  | 131389440/170498071 [00:19<00:06, 6025176.03it/s]
 77%|███████▋  | 131992576/170498071 [00:19<00:06, 5905141.29it/s]
 78%|███████▊  | 132584448/170498071 [00:19<00:06, 5899583.55it/s]
 78%|███████▊  | 133192704/170498071 [00:19<00:06, 5912272.01it/s]
 78%|███████▊  | 133792768/170498071 [00:19<00:06, 5929176.37it/s]
 79%|███████▉  | 134386688/170498071 [00:19<00:06, 5912887.35it/s]
 79%|███████▉  | 134984704/170498071 [00:19<00:06, 5894259.43it/s]
 80%|███████▉  | 135584768/170498071 [00:19<00:05, 5904666.32it/s]
 80%|███████▉  | 136200192/170498071 [00:19<00:05, 5924087.34it/s]
 80%|████████  | 136816640/170498071 [00:19<00:05, 5936056.87it/s]
 81%|████████  | 137410560/170498071 [00:20<00:05, 5930620.51it/s]
 81%|████████  | 138023936/170498071 [00:20<00:05, 5957449.82it/s]
 81%|████████▏ | 138619904/170498071 [00:20<00:05, 5920628.91it/s]
 82%|████████▏ | 139212800/170498071 [00:20<00:05, 5891301.15it/s]
 82%|████████▏ | 139802624/170498071 [00:20<00:05, 5833229.01it/s]
 82%|████████▏ | 140386304/170498071 [00:20<00:05, 5823362.51it/s]
 83%|████████▎ | 140968960/170498071 [00:20<00:05, 5798644.89it/s]
 83%|████████▎ | 141576192/170498071 [00:20<00:04, 5830853.79it/s]
 83%|████████▎ | 142176256/170498071 [00:20<00:04, 5862244.56it/s]
 84%|████████▎ | 142763008/170498071 [00:20<00:04, 5860174.48it/s]
 84%|████████▍ | 143349760/170498071 [00:21<00:04, 5769674.41it/s]
 84%|████████▍ | 143944704/170498071 [00:21<00:04, 5772068.39it/s]
 85%|████████▍ | 144522240/170498071 [00:21<00:04, 5737879.87it/s]
 85%|████████▌ | 145144832/170498071 [00:21<00:04, 5822461.61it/s]
 85%|████████▌ | 145744896/170498071 [00:21<00:04, 5843051.41it/s]
 86%|████████▌ | 146336768/170498071 [00:21<00:04, 5727665.81it/s]
 86%|████████▌ | 147000320/170498071 [00:21<00:03, 5939089.20it/s]
 87%|████████▋ | 147600384/170498071 [00:21<00:03, 5900993.66it/s]
 87%|████████▋ | 148192256/170498071 [00:21<00:03, 5863311.67it/s]
 87%|████████▋ | 148796416/170498071 [00:22<00:03, 5915542.14it/s]
 88%|████████▊ | 149388288/170498071 [00:22<00:03, 5871813.53it/s]
 88%|████████▊ | 149976064/170498071 [00:22<00:03, 5829919.45it/s]
 88%|████████▊ | 150567936/170498071 [00:22<00:03, 5818672.78it/s]
 89%|████████▊ | 151168000/170498071 [00:22<00:03, 5861771.91it/s]
 89%|████████▉ | 151768064/170498071 [00:22<00:03, 5895348.50it/s]
 89%|████████▉ | 152359936/170498071 [00:22<00:03, 5888309.81it/s]
 90%|████████▉ | 152960000/170498071 [00:22<00:02, 5894389.73it/s]
 90%|█████████ | 153584640/170498071 [00:22<00:02, 5959109.14it/s]
 90%|█████████ | 154180608/170498071 [00:22<00:02, 5896331.03it/s]
 91%|█████████ | 154800128/170498071 [00:23<00:02, 5939169.10it/s]
 91%|█████████ | 155416576/170498071 [00:23<00:02, 5953544.48it/s]
 92%|█████████▏| 156040192/170498071 [00:23<00:02, 5999192.07it/s]
 92%|█████████▏| 156656640/170498071 [00:23<00:02, 6013157.48it/s]
 92%|█████████▏| 157258752/170498071 [00:23<00:02, 5980510.82it/s]
 93%|█████████▎| 157872128/170498071 [00:23<00:02, 5990711.20it/s]
 93%|█████████▎| 158496768/170498071 [00:23<00:01, 6021009.90it/s]
 93%|█████████▎| 159136768/170498071 [00:23<00:01, 6069876.12it/s]
 94%|█████████▎| 159760384/170498071 [00:23<00:01, 6106792.23it/s]
 94%|█████████▍| 160376832/170498071 [00:23<00:01, 6113669.01it/s]
 94%|█████████▍| 160992256/170498071 [00:24<00:01, 6072603.24it/s]
 95%|█████████▍| 161608704/170498071 [00:24<00:01, 6098509.26it/s]
 95%|█████████▌| 162232320/170498071 [00:24<00:01, 6120042.25it/s]
 96%|█████████▌| 162872320/170498071 [00:24<00:01, 6150183.27it/s]
 96%|█████████▌| 163495936/170498071 [00:24<00:01, 6151866.07it/s]
 96%|█████████▋| 164135936/170498071 [00:24<00:01, 6168918.60it/s]
 97%|█████████▋| 164753408/170498071 [00:24<00:00, 5974781.61it/s]
 97%|█████████▋| 165392384/170498071 [00:24<00:00, 6091741.61it/s]
 97%|█████████▋| 166016000/170498071 [00:24<00:00, 6109896.64it/s]
 98%|█████████▊| 166632448/170498071 [00:24<00:00, 6120472.21it/s]
 98%|█████████▊| 167272448/170498071 [00:25<00:00, 6160722.64it/s]
 98%|█████████▊| 167889920/170498071 [00:25<00:00, 6145135.14it/s]
 99%|█████████▉| 168505344/170498071 [00:25<00:00, 6098409.33it/s]
 99%|█████████▉| 169136128/170498071 [00:25<00:00, 6099973.37it/s]
100%|█████████▉| 169752576/170498071 [00:25<00:00, 6094876.14it/s]
100%|█████████▉| 170416128/170498071 [00:25<00:00, 6225983.19it/s]
170499072it [00:25, 6659623.04it/s]                               
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
Extracting ./data_0/cifar-10-python.tar.gz to ./data_0
Files already downloaded and verified

  0%|          | 0.00/97.8M [00:00<?, ?B/s]
 10%|█         | 9.97M/97.8M [00:00<00:00, 105MB/s]
 20%|██        | 19.9M/97.8M [00:00<00:00, 101MB/s]
 35%|███▍      | 33.9M/97.8M [00:00<00:00, 120MB/s]
 48%|████▊     | 46.6M/97.8M [00:00<00:00, 125MB/s]
 63%|██████▎   | 61.2M/97.8M [00:00<00:00, 135MB/s]
 78%|███████▊  | 75.9M/97.8M [00:00<00:00, 135MB/s]
 91%|█████████ | 88.8M/97.8M [00:00<00:00, 127MB/s]
100%|██████████| 97.8M/97.8M [00:00<00:00, 129MB/s]
WARNING:urllib3.connectionpool:Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /mlflow/v2.0/subscriptions/9be1367a-bcc9-4275-8b3d-a0469f4119fa/resourceGroups/some_name/providers/Microsoft.MachineLearningServices/workspaces/some_name/api/2.0/mlflow/runs/log-metric
[Epoch 1] loss: 557.058
[Epoch 2] loss: 407.764
[Epoch 3] loss: 339.983

However, this is what I set it to in train-env.yaml:

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: nvidia_pytorch
build:
  path: ../../../data-science/environment/
tags:
  os: ubuntu
  os_version: 20.04
  hpcx: 2.10
  mpi: openmpi
  mpi_version: 4.1.2rc4
  ucx: 1.12.0
  cuda: 11.6.2
  cudnn: 8.4.0.27
  # nccl: 2.12.10
  nccl: 2.17.1
  rdma_core: 36.0
  nsight_compute: 2022.1.1.2
  nsight_systems: "2022.2.1.31-5fe97ab"
  nccl_test: 2.11.0
  # azureml-defaults: 1.41.0
  # mlflow: 1.25.1
  azureml-defaults: 1.50.0
  mlflow: 2.3.2
  transformers: 4.18.0

Looks like this version of PyTorch is linked statically with NCCL 2.10.3. No matter what NCCL version you install, it will use the one it embeds.

How does it work with 2.10.3? I don't see any error?

so I changed the version of NCCL to 2.18.1 however, it is shown that it is set to 2.14.3, I am not sure how I can fix this error. I asked it in PyTorch forum but no solution. It looks like a NCCL problem. I am not able to use higher version of CUDA since I have K80 GPUs.

total 0
NCCL version is:  (2, 14, 3)
System information: Linux #36~20.04.1-Ubuntu SMP Tue Dec 6 17:00:26 UTC 2022
Python version: 3.8.10
MLflow version: 2.3.2
MLflow module location: /usr/local/lib/python3.8/dist-packages/mlflow/__init__.py
Tracking URI: URI
Registry URI: URI
MLflow environment variables: 
  MLFLOW_DISABLE_ENV_MANAGER_CONDA_WARNING: True
  MLFLOW_EXPERIMENT_ID: 03bf0c01-34b3-4b8f-9713-b744f0350832
  MLFLOW_EXPERIMENT_NAME: dev_CIFAR10_DDP_train_test2
  MLFLOW_RUN_ID:

MLflow dependencies: 
  Flask: 2.3.2
  Jinja2: 3.1.2
  alembic: 1.11.1
  click: 8.1.3
  cloudpickle: 2.2.0
  databricks-cli: 0.17.7
  docker: 6.1.2
  entrypoints: 0.4
  gitpython: 3.1.31
  gunicorn: 20.1.0
  importlib-metadata: 5.1.0
  markdown: 3.4.1
  matplotlib: 3.5.2
  numpy: 1.22.2
  packaging: 22.0
  pandas: 1.5.2
  protobuf: 3.20.1
  pyarrow: 9.0.0
  pytz: 2022.6
  pyyaml: 6.0
  querystring-parser: 1.2.4
  requests: 2.28.1
  scikit-learn: 0.24.2
  scipy: 1.6.3
  sqlalchemy: 2.0.15
  sqlparse: 0.4.4
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO cudaDriverVersion 11040
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO Bootstrap : Using eth0:10.0.0.7<0>
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO P2P plugin IBext
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO NET/IB : No device found.
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.7<0>
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Using network Socket
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a00-71d4-000d-3a00-71d4000d3a00 is not a PCI device (vmbus). Attaching to first CPU
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO === System : maxBw 5.0 totalBw 12.0 ===
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO CPU/0 (1/1/1)
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO + PCI[5000.0] - NIC/0
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO + PCI[12.0] - GPU/100000 (8)
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO + PCI[12.0] - GPU/200000 (9)
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO + PCI[12.0] - GPU/300000 (10)
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO + PCI[12.0] - GPU/400000 (11)
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO ==========================================
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) GPU/300000 (3/5.000000/PHB) GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Setting affinity for GPU 2 to 0fff
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 5.000000/5.000000, type PHB/PHB, sameChannels 1
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO  0 : NET/0 GPU/8 GPU/9 GPU/10 GPU/11 NET/0
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 6.000000/5.000000, type PHB/PHB, sameChannels 1
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO  0 : NET/0 GPU/8 GPU/9 GPU/10 GPU/11 NET/0
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type NVL/PIX, sameChannels 1
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Ring 00 : 9 -> 10 -> 11
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Ring 01 : 9 -> 10 -> 11
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->9
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Channel 00 : 10[300000] -> 11[400000] via SHM/direct/direct
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Channel 01 : 10[300000] -> 11[400000] via SHM/direct/direct
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Connected all rings
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Channel 00 : 10[300000] -> 9[200000] via SHM/direct/direct
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Channel 01 : 10[300000] -> 9[200000] via SHM/direct/direct
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO Connected all trees
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO 2 coll channels, 2 p2p channels,World size: 16
local rank is 2 and world rank is 10
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_10/cifar-10-python.tar.gz
Failed download. Trying https -> http instead. Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_10/cifar-10-python.tar.gz

  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 458752/170498071 [00:00<00:37, 4563814.69it/s]
  3%|▎         | 5144576/170498071 [00:00<00:05, 29383762.59it/s]
  6%|▋         | 10911744/170498071 [00:00<00:03, 42197352.68it/s]
 10%|▉         | 16711680/170498071 [00:00<00:03, 48285422.28it/s]
 13%|█▎        | 22511616/170498071 [00:00<00:02, 51745710.13it/s]
 17%|█▋        | 28278784/170498071 [00:00<00:02, 53649710.56it/s]
 20%|██        | 34111488/170498071 [00:00<00:02, 55051428.62it/s]
 23%|██▎       | 39616512/170498071 [00:00<00:02, 54752953.84it/s]
 27%|██▋       | 45416448/170498071 [00:00<00:02, 55654936.27it/s]
 31%|███▏      | 53313536/170498071 [00:01<00:01, 62765961.80it/s]
 38%|███▊      | 64880640/170498071 [00:01<00:01, 78870849.68it/s]
 45%|████▍     | 76414976/170498071 [00:01<00:01, 89917128.56it/s]
 51%|█████▏    | 87621632/170498071 [00:01<00:00, 96543328.70it/s]
 58%|█████▊    | 99188736/170498071 [00:01<00:00, 102260225.54it/s]
 65%|██████▍   | 110428160/170498071 [00:01<00:00, 105277849.46it/s]
 72%|███████▏  | 121962496/170498071 [00:01<00:00, 108244947.78it/s]
 78%|███████▊  | 133529600/170498071 [00:01<00:00, 110453582.54it/s]
 85%|████████▌ | 145162240/170498071 [00:01<00:00, 112113339.44it/s]
 92%|█████████▏| 156696576/170498071 [00:01<00:00, 113080563.55it/s]
 99%|█████████▊| 168329216/170498071 [00:02<00:00, 114013376.61it/s]
100%|██████████| 170498071/170498071 [00:02<00:00, 84146552.80it/s] 
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
Extracting ./data_10/cifar-10-python.tar.gz to ./data_10
Files already downloaded and verified

  0%|          | 0.00/97.8M [00:00<?, ?B/s]
 20%|█▉        | 19.6M/97.8M [00:00<00:00, 205MB/s]
 46%|████▌     | 44.9M/97.8M [00:00<00:00, 241MB/s]
 75%|███████▌  | 73.4M/97.8M [00:00<00:00, 267MB/s]
100%|██████████| 97.8M/97.8M [00:00<00:00, 265MB/s] 2 p2p channels per peer
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO NCCL_P2P_PXN_LEVEL set by environment to 0.
34d0f284fac94434817d429e96547367000003:44:218 [2] NCCL INFO comm 0x2c1e8620 rank 10 nranks 16 cudaDev 2 busId 300000 - Init COMPLETE
[E ProcessGroupGloo.cpp:137] Rank 10 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.

Traceback (most recent call last):
  File "train.py", line 107, in <module>
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: Rank 10 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
 Original exception: 
[../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
34d0f284fac94434817d429e96547367000003:44:220 [2] NCCL INFO [Service thread] Connection closed by localRank 2
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO comm 0x2c1e8620 rank 10 nranks 16 cudaDev 2 busId 300000 - Abort COMPLETE

Here's my Dockerfile:

# check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
# nvidia containers https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-04.html#rel-23-04
# FROM nvcr.io/nvidia/pytorch:22.04-py3
#FROM nvcr.io/nvidia/pytorch:23.02-py3 #requires GPUs with compute capability of 5+
FROM nvcr.io/nvidia/pytorch:22.12-py3

##############################################################################
# NCCL TESTS
##############################################################################
ENV NCCL_TESTS_TAG=v2.11.0

# NOTE: adding gencodes to support K80, M60, V100, A100
RUN mkdir /tmp/nccltests && \
    cd /tmp/nccltests && \
    git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make \
    MPI=1 MPI_HOME=/opt/hpcx/ompi \
    NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    CUDA_HOME=/usr/local/cuda && \
    cp ./build/* /usr/local/bin && \
    rm -rf /tmp/nccltests

# Install dependencies missing in this container
# NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0
COPY requirements.txt ./
RUN pip install -r requirements.txt

# add ndv4-topo.xml
RUN mkdir /opt/microsoft/
ADD ./ndv4-topo.xml /opt/microsoft

# to use on A100, enable env var below in your job
# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

# adjusts the level of info from NCCL tests
ENV NCCL_DEBUG="INFO"
ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

# Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.
# ENV NCCL_IB_PCI_RELAXED_ORDERING="1"
# suggested to set ENV NCCL_IB_PCI_RELAXED_ORDERING to 0 for NCCL 2.18.1
ENV NCCL_IB_PCI_RELAXED_ORDERING="0" 
ENV CUDA_DEVICE_ORDER="PCI_BUS_ID"
ENV NCCL_SOCKET_IFNAME="eth0"
ENV NCCL_P2P_PXN_LEVEL="0"
# ENV NCCL_SOCKET_IFNAME='lo'
ENV NCCL_IB_DISABLE="1"

and here's my requirements.txt:

# torch and torchvision compatibility matrix https://github.com/pytorch/pytorch/wiki/PyTorch-Versions
torch==1.13.0
torchvision==0.14.0

mlflow==2.3.2
azureml-mlflow==1.50.0
matplotlib==3.5.2
tqdm==4.64.0
psutil==5.9.0

# for unit testing
pytest==7.1.2

Here's train.py code:

import time
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import mlflow
import os
import datetime
import gc
import configparser
import logging
import argparse

from PIL import Image
from torch.distributed.elastic.multiprocessing.errors import record #TODO create main and use at sign record later

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

start_time = time.time()

# torch.cuda.empty_cache()
# gc.collect()

torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True

print("NCCL version is: ", torch.cuda.nccl.version())

# MLflow >= 2.0
mlflow.doctor()

# Set the seed for reproducibility
torch.manual_seed(42)

# Set up the data loading parameters
batch_size = 128
num_epochs = 10
num_workers = 4
pin_memory = True

# Get the world size and rank to determine the process group
world_size = int(os.environ['WORLD_SIZE'])
world_rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])

print("World size:", world_size)
print("local rank is {} and world rank is {}".format(local_rank, world_rank))

is_distributed = world_size > 1

if is_distributed:
    batch_size = batch_size // world_size
    batch_size = max(batch_size, 1)

# Set the backend to NCCL for distributed training
dist.init_process_group(backend="nccl",
                        init_method="env://",
                        world_size=world_size,
                        rank=world_rank)

# Set the device to the current local rank
torch.cuda.set_device(local_rank)
device = torch.device('cuda', local_rank)

dist.barrier()

# Define the transforms for the dataset
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
])

# Load the CIFAR-10 dataset

data_root = './data_' + str(world_rank)
train_dataset = torchvision.datasets.CIFAR10(root=data_root, train=True, download=True, transform=transform_train)
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset=train_dataset, num_replicas=world_size, rank=world_rank, shuffle=True) if is_distributed else None
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=num_workers, pin_memory=pin_memory, sampler=train_sampler)

test_dataset = torchvision.datasets.CIFAR10(root=data_root, train=False, download=True, transform=transform_test)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=pin_memory)

# Define the ResNet50 model
model = torchvision.models.resnet50(pretrained=True)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)

# Move the model to the GPU
model = model.to(device)

# Wrap the model with DistributedDataParallel
if is_distributed:
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model for the specified number of epochs
for epoch in range(num_epochs):
    running_loss = 0.0
    train_sampler.set_epoch(epoch) ### why is this line necessary??
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()

        optimizer.step()

        running_loss += loss.item()

    print('[Epoch %d] loss: %.3f' % (epoch + 1, running_loss))
    if world_rank == 0:
        # Log the loss and running loss as MLFlow metrics
        mlflow.log_metric("loss", loss.item())
        mlflow.log_metric("running loss", running_loss)

dist.barrier()
# Save the trained model
if world_rank == 0:
    checkpoints_path = "train_checkpoints"
    os.makedirs(checkpoints_path, exist_ok=True)
    torch.save(model.state_dict(), '{}/{}-{}.pth'.format(checkpoints_path, 'resnet50_cifar10', world_rank))
    mlflow.pytorch.log_model(model, "resnet50_cifar10_{}.pth".format(world_rank))
    # mlflow.log_artifact('{}/{}-{}.pth'.format(checkpoints_path, 'resnet50_cifar10', world_rank), artifact_path="model_state_dict")

    # Evaluate the model on the test set and save inference on 6 random images
    correct = 0
    total = 0
    with torch.no_grad():
        fig, axs = plt.subplots(2, 3, figsize=(8, 6), dpi=100)
        axs = axs.flatten()
        count = 0
        for data in test_loader:
            if count == 6:
                break
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

            # Save the inference on the 6 random images
            if count < 6:
                image = np.transpose(inputs[0].cpu().numpy(), (1, 2, 0))
                confidence = torch.softmax(outputs, dim=1)[0][predicted[0]].cpu().numpy()
                class_name = test_dataset.classes[predicted[0]]
                axs[count].imshow(image)
                axs[count].set_title(f'Class: {class_name}\nConfidence: {confidence:.2f}')
                axs[count].axis('off')
                count += 1

    test_accuracy = 100 * correct / total
    print('Test accuracy: %.2f %%' % test_accuracy)

# # Average the test accuracy across all processes

# correct = torch.tensor(correct, dtype=torch.int8)
# correct = correct.to(device)
# torch.distributed.all_reduce(correct, op=torch.distributed.ReduceOp.SUM)
# total = torch.tensor(total, dtype=torch.torch.int8)
# total = total.to(device)
# torch.distributed.all_reduce(total, op=torch.distributed.ReduceOp.SUM)
# test_accuracy = 100 * correct / total
# test_accuracy /= world_size

# print('Test accuracy: %.2f %%' % test_accuracy)

# Save the plot with the 6 random images and their predicted classes and prediction confidence
test_img_file_name = 'test_images_' + str(world_rank) + '.png'
plt.savefig(test_img_file_name)

# Log the test accuracy and elapsed time to MLflow
if world_rank == 0:
    mlflow.log_metric("test accuracy", test_accuracy)

end_time = time.time()
elapsed_time = end_time - start_time
print('Elapsed time: ', elapsed_time)
if world_rank == 0:
    mlflow.log_metric("elapsed time", elapsed_time)

if world_rank == 0:
    # Save the plot with the 6 random images and their predicted classes and prediction confidence as an artifact in MLflow
    image = Image.open(test_img_file_name)
    image = image.convert('RGBA')
    image_buffer = np.array(image)
    image_buffer = image_buffer[:, :, [2, 1, 0, 3]]
    image_buffer = np.ascontiguousarray(image_buffer)
    artifact_file_name = "inference_on_test_images_" + str(world_rank) + ".png"
    mlflow.log_image(image_buffer, artifact_file=artifact_file_name)

# End the MLflow run
if mlflow.active_run():
    mlflow.end_run()

dist.destroy_process_group()

I don't see how that is a NCCL issue. Your PyT is probably built against a static version of libnccl. I'm not a PyT expert but the forums suggest you need to rebuild it with:

export USE_SYSTEM_NCCL=1

Or find a container built with a newer version of NCCL

@AddyLaddy Thanks for having a look. Why do you say this is not an NCCL error? It shows NCCL here:

Traceback (most recent call last):
  File "train.py", line 107, in <module>
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: Rank 10 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
 Original exception: 
[../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
34d0f284fac94434817d429e96547367000003:44:220 [2] NCCL INFO [Service thread] Connection closed by localRank 2
34d0f284fac94434817d429e96547367000003:44:44 [2] NCCL INFO comm 0x2c1e8620 rank 10 nranks 16 cudaDev 2 busId 300000 - Abort COMPLETE

I will also post it on PyTorch forum.

Could you kindly please tell me which NVIDIA PyTorch container has NCCL 2.17.1+?

Here's a list I have. https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/

I was not able to use #FROM nvcr.io/nvidia/pytorch:23.02-py3 Because it requires GPUs with compute capability of 5+ and my cluster contains 4 nodes each with 4 K80 GPUs (CC=3.7)

Currently I am using this one FROM nvcr.io/nvidia/pytorch:22.12-py3

Ok it's likely that there isn't a K80 version with the latest NCCL library as CUDA has now dropped support for the Kepler architecture. The NCCL releases are only built and tested against the supported CUDA architectures at that time. Why do you need the latest version of NCCL? Is there some issue with the version inside a container that still supports K80?

@AddyLaddy I moved on using a cluster of two nodes each having 2 V100 and problem is gone. I also moved from US East 2 to US East.

NVIDIA / nccl

Azure MLOps Pipelining -- NCCL WARN [Rem Allocator] Allocation failed & include/alloc.h:48 #844