monajalal commented 1 year ago

I am using an Azure GPU cluster with 4 nodes each with 4 K80 GPUs (16 GPUs total)

a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.8<0>
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO P2P plugin IBext
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO NET/IB : No device found.
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.8<0>
a87f3d1c47934e96853c9057f083d715000004:34:34 [3] NCCL INFO Using network Socket
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3ae3-6435-000d-3ae3-6435000d3ae3 is not a PCI device (vmbus). Attaching to first CPU
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Attribute coll of node net not found
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO CPU/0 (1/1/1)
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO + PCI[5000.0] - NIC/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO + PCI[12.0] - GPU/400000 (3)
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO ==========================================
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO GPU/400000 :GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB) 
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO NET/0 :GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC) 
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type LOC/PHB, sameChannels 1
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO  0 : NET/0 GPU/3 NET/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 10.000000/5.000000, type LOC/PHB, sameChannels 1
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO  0 : NET/0 GPU/3 NET/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type LOC/PIX, sameChannels 1
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Tree 0 : 2 -> 3 -> -1/-1/-1
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Tree 1 : -1 -> 3 -> 1/-1/-1
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Ring 00 : 2 -> 3 -> 0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Ring 01 : 2 -> 3 -> 0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 1/-1/-1->3->-1
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Setting affinity for GPU 3 to 0fff
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Channel 00 : 2[300000] -> 3[400000] [receive] via NET/Socket/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Channel 01 : 2[300000] -> 3[400000] [receive] via NET/Socket/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Channel 00 : 3[400000] -> 0[100000] [send] via NET/Socket/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Channel 01 : 3[400000] -> 0[100000] [send] via NET/Socket/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Connected all rings
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Channel 01 : 1[200000] -> 3[400000] [receive] via NET/Socket/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Channel 01 : 3[400000] -> 1[200000] [send] via NET/Socket/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Channel 00 : 3[400000] -> 2[300000] [send] via NET/Socket/0
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO Connected all trees
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
a87f3d1c47934e96853c9057f083d715000004:34:83 [3] NCCL INFO comm 0x14d34c001240 rank 3 nranks 4 cudaDev 3 busId 400000 - Init COMPLETE
MLflow version: 2.3.2
Tracking URI: azureml:URI
Artifact URI: azureml:URI
World size: 4
local rank is 0 and world rank is 3
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_3/cifar-10-python.tar.gz

  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 520192/170498071 [00:00<00:32, 5194840.47it/s]
  2%|▏         | 2864128/170498071 [00:00<00:10, 15854543.05it/s]
  3%|▎         | 4704256/170498071 [00:00<00:09, 16941491.88it/s]
  4%|▍         | 6396928/170498071 [00:00<00:14, 11011463.67it/s]
  5%|▍         | 7703552/170498071 [00:00<00:16, 9586379.53it/s] 
  5%|▌         | 8805376/170498071 [00:00<00:17, 8985559.12it/s]
  6%|▌         | 9793536/170498071 [00:00<00:18, 8537661.13it/s]
  6%|▋         | 10702848/170498071 [00:01<00:19, 8246260.06it/s]
  7%|▋         | 11560960/170498071 [00:01<00:19, 8175555.42it/s]
  7%|▋         | 12400640/170498071 [00:01<00:19, 8173845.68it/s]
  8%|▊         | 13233152/170498071 [00:01<00:19, 8108471.82it/s]
  8%|▊         | 14104576/170498071 [00:01<00:18, 8251811.83it/s]
  9%|▉         | 14960640/170498071 [00:01<00:18, 8338124.33it/s]
  9%|▉         | 15801344/170498071 [00:01<00:19, 7751337.15it/s]
 10%|▉         | 16587776/170498071 [00:01<00:20, 7352512.78it/s]
 10%|█         | 17332224/170498071 [00:01<00:21, 7239455.13it/s]
 11%|█         | 18062336/170498071 [00:02<00:21, 7091892.33it/s]
 11%|█         | 18792448/170498071 [00:02<00:21, 7147551.45it/s]
 11%|█▏        | 19584000/170498071 [00:02<00:20, 7351288.51it/s]
 12%|█▏        | 20368384/170498071 [00:02<00:20, 7482358.05it/s]
 12%|█▏        | 21224448/170498071 [00:02<00:19, 7777646.31it/s]
 13%|█▎        | 22064128/170498071 [00:02<00:18, 7943451.40it/s]
 13%|█▎        | 22860800/170498071 [00:02<00:18, 7834591.12it/s]
 14%|█▍        | 23646208/170498071 [00:02<00:20, 7230745.28it/s]
 14%|█▍        | 24379392/170498071 [00:02<00:21, 6875060.51it/s]
 15%|█▍        | 25075712/170498071 [00:03<00:21, 6844180.19it/s]
 15%|█▌        | 25766912/170498071 [00:03<00:21, 6839477.05it/s]
 16%|█▌        | 26455040/170498071 [00:03<00:21, 6752275.21it/s]
 16%|█▌        | 27168768/170498071 [00:03<00:20, 6858398.77it/s]
 16%|█▋        | 27912192/170498071 [00:03<00:20, 7011627.27it/s]
 17%|█▋        | 28696576/170498071 [00:03<00:19, 7193577.69it/s]
 17%|█▋        | 29479936/170498071 [00:03<00:19, 7304670.20it/s]
 18%|█▊        | 30256128/170498071 [00:03<00:18, 7426524.61it/s]
 18%|█▊        | 31015936/170498071 [00:03<00:18, 7459813.46it/s]
 19%|█▊        | 31800320/170498071 [00:03<00:18, 7549592.00it/s]
 19%|█▉        | 32556032/170498071 [00:04<00:18, 7548027.85it/s]
 20%|█▉        | 33368064/170498071 [00:04<00:17, 7716612.67it/s]
 20%|██        | 34215936/170498071 [00:04<00:17, 7932480.99it/s]
 21%|██        | 35010560/170498071 [00:04<00:18, 7191215.43it/s]
 21%|██        | 35743744/170498071 [00:04<00:19, 7003741.52it/s]
 21%|██▏       | 36454400/170498071 [00:04<00:20, 6697230.71it/s]
 22%|██▏       | 37132288/170498071 [00:04<00:20, 6639603.19it/s]
 22%|██▏       | 37801984/170498071 [00:04<00:19, 6653832.42it/s]
 23%|██▎       | 38504448/170498071 [00:04<00:19, 6759282.42it/s]
 23%|██▎       | 39184384/170498071 [00:05<00:19, 6767620.48it/s]
 23%|██▎       | 39896064/170498071 [00:05<00:19, 6834598.13it/s]
 24%|██▍       | 40638464/170498071 [00:05<00:18, 7007549.04it/s]
 24%|██▍       | 41416704/170498071 [00:05<00:17, 7231622.72it/s]
 25%|██▍       | 42184704/170498071 [00:05<00:17, 7359281.31it/s]
 25%|██▌       | 43016192/170498071 [00:05<00:16, 7629210.99it/s]
 26%|██▌       | 43808768/170498071 [00:05<00:16, 7716279.59it/s]
 26%|██▌       | 44632064/170498071 [00:05<00:15, 7870344.03it/s]
 27%|██▋       | 45432832/170498071 [00:05<00:15, 7872277.30it/s]
 27%|██▋       | 46328832/170498071 [00:05<00:15, 8183792.99it/s]
 28%|██▊       | 47192064/170498071 [00:06<00:14, 8282010.51it/s]
 28%|██▊       | 48064512/170498071 [00:06<00:14, 8370392.62it/s]
 29%|██▊       | 48944128/170498071 [00:06<00:14, 8480197.06it/s]
 29%|██▉       | 49872896/170498071 [00:06<00:13, 8701732.10it/s]
 30%|██▉       | 50743296/170498071 [00:06<00:14, 8476636.38it/s]
 30%|███       | 51593216/170498071 [00:06<00:15, 7705050.36it/s]
 31%|███       | 52377600/170498071 [00:06<00:15, 7507338.45it/s]
 31%|███       | 53138432/170498071 [00:06<00:16, 7305009.51it/s]
 32%|███▏      | 53875712/170498071 [00:06<00:16, 7216771.98it/s]
 32%|███▏      | 54632448/170498071 [00:07<00:15, 7299688.49it/s]
 32%|███▏      | 55392256/170498071 [00:07<00:15, 7383876.87it/s]
 33%|███▎      | 56168448/170498071 [00:07<00:15, 7486231.91it/s]
 33%|███▎      | 56976384/170498071 [00:07<00:14, 7638340.98it/s]
 34%|███▍      | 57776128/170498071 [00:07<00:14, 7737032.61it/s]
 34%|███▍      | 58568704/170498071 [00:07<00:14, 7777398.88it/s]
 35%|███▍      | 59431936/170498071 [00:07<00:13, 8024489.93it/s]
 35%|███▌      | 60296192/170498071 [00:07<00:13, 8199968.90it/s]
 36%|███▌      | 61176832/170498071 [00:07<00:13, 8308733.77it/s]
 36%|███▋      | 62096384/170498071 [00:07<00:12, 8548460.71it/s]
 37%|███▋      | 62992384/170498071 [00:08<00:12, 8663440.51it/s]
 38%|███▊      | 63952896/170498071 [00:08<00:11, 8905381.31it/s]
 38%|███▊      | 64928768/170498071 [00:08<00:11, 9150532.84it/s]
 39%|███▊      | 65944576/170498071 [00:08<00:11, 9423003.19it/s]
 39%|███▉      | 66928640/170498071 [00:08<00:10, 9528441.74it/s]
 40%|███▉      | 67944448/170498071 [00:08<00:10, 9646268.25it/s]
 40%|████      | 68968448/170498071 [00:08<00:10, 9795156.77it/s]
 41%|████      | 70032384/170498071 [00:08<00:10, 10012031.59it/s]
 42%|████▏     | 71079936/170498071 [00:08<00:09, 10100356.93it/s]
 42%|████▏     | 72184832/170498071 [00:08<00:09, 10299964.76it/s]
 43%|████▎     | 73383936/170498071 [00:09<00:09, 10780856.00it/s]
 44%|████▎     | 74463232/170498071 [00:09<00:08, 10780507.23it/s]
 44%|████▍     | 75541504/170498071 [00:09<00:09, 10033854.77it/s]
 45%|████▍     | 76555264/170498071 [00:09<00:10, 8884217.12it/s] 
 45%|████▌     | 77472768/170498071 [00:09<00:13, 7111998.79it/s]
 46%|████▌     | 78253056/170498071 [00:09<00:14, 6431002.67it/s]
 46%|████▋     | 78949376/170498071 [00:09<00:14, 6117682.32it/s]
 47%|████▋     | 79595520/170498071 [00:10<00:15, 5889010.99it/s]
 47%|████▋     | 80205824/170498071 [00:10<00:15, 5883410.03it/s]
 47%|████▋     | 80824320/170498071 [00:10<00:15, 5957083.54it/s]
 48%|████▊     | 81440768/170498071 [00:10<00:14, 5959993.78it/s]
 48%|████▊     | 82045952/170498071 [00:10<00:14, 5945098.31it/s]
 48%|████▊     | 82647040/170498071 [00:10<00:16, 5463978.21it/s]
 49%|████▉     | 83203072/170498071 [00:10<00:16, 5322131.43it/s]
 49%|████▉     | 83741696/170498071 [00:10<00:16, 5190566.31it/s]
 49%|████▉     | 84264960/170498071 [00:10<00:18, 4748293.96it/s]
 50%|████▉     | 84748288/170498071 [00:11<00:19, 4434869.16it/s]
 50%|████▉     | 85198848/170498071 [00:11<00:19, 4337630.72it/s]
 50%|█████     | 85637120/170498071 [00:11<00:21, 4008286.45it/s]
 50%|█████     | 86043648/170498071 [00:11<00:22, 3736585.20it/s]
 51%|█████     | 86422528/170498071 [00:11<00:23, 3655014.66it/s]
 51%|█████     | 86800384/170498071 [00:11<00:23, 3612006.04it/s]
 51%|█████     | 87207936/170498071 [00:11<00:22, 3734579.77it/s]
 51%|█████▏    | 87591936/170498071 [00:11<00:22, 3759911.92it/s]
 52%|█████▏    | 88048640/170498071 [00:11<00:20, 3977489.03it/s]
 52%|█████▏    | 88472576/170498071 [00:12<00:20, 4036139.76it/s]
 52%|█████▏    | 88968192/170498071 [00:12<00:19, 4279074.23it/s]
 52%|█████▏    | 89440256/170498071 [00:12<00:18, 4384153.98it/s]
 53%|█████▎    | 89952256/170498071 [00:12<00:17, 4525689.88it/s]
 53%|█████▎    | 90496000/170498071 [00:12<00:17, 4664739.21it/s]
 53%|█████▎    | 91112448/170498071 [00:12<00:16, 4946520.43it/s]
 54%|█████▍    | 91760640/170498071 [00:12<00:15, 5239873.26it/s]
 54%|█████▍    | 92432384/170498071 [00:12<00:14, 5457407.11it/s]
 55%|█████▍    | 93104128/170498071 [00:12<00:13, 5811770.00it/s]
 55%|█████▍    | 93704192/170498071 [00:12<00:13, 5854062.31it/s]
 55%|█████▌    | 94400512/170498071 [00:13<00:12, 6136226.03it/s]
 56%|█████▌    | 95056896/170498071 [00:13<00:12, 5916120.60it/s]
 56%|█████▌    | 95672320/170498071 [00:13<00:12, 5964959.88it/s]
 56%|█████▋    | 96271360/170498071 [00:13<00:13, 5675269.82it/s]
 57%|█████▋    | 96842752/170498071 [00:13<00:13, 5538180.48it/s]
 57%|█████▋    | 97424384/170498071 [00:13<00:13, 5512720.15it/s]
 57%|█████▋    | 98000896/170498071 [00:13<00:13, 5561369.31it/s]
 58%|█████▊    | 98608128/170498071 [00:13<00:12, 5667288.47it/s]
 58%|█████▊    | 99232768/170498071 [00:13<00:12, 5808098.14it/s]
 59%|█████▊    | 99872768/170498071 [00:14<00:11, 5976174.44it/s]
 59%|█████▉    | 100519936/170498071 [00:14<00:11, 6117897.64it/s]
 59%|█████▉    | 101208064/170498071 [00:14<00:10, 6309133.47it/s]
 60%|█████▉    | 101936128/170498071 [00:14<00:10, 6583796.69it/s]
 60%|██████    | 102624256/170498071 [00:14<00:10, 6584540.95it/s]
 61%|██████    | 103416832/170498071 [00:14<00:09, 6879600.87it/s]
 61%|██████    | 104184832/170498071 [00:14<00:09, 7105824.95it/s]
 62%|██████▏   | 104896512/170498071 [00:14<00:09, 7047189.92it/s]
 62%|██████▏   | 105602048/170498071 [00:14<00:09, 6838559.46it/s]
 62%|██████▏   | 106288128/170498071 [00:15<00:10, 6387956.34it/s]
 63%|██████▎   | 106933248/170498071 [00:15<00:10, 6084909.88it/s]
 63%|██████▎   | 107547648/170498071 [00:15<00:11, 5666109.83it/s]
 63%|██████▎   | 108122112/170498071 [00:15<00:11, 5255164.02it/s]
 64%|██████▎   | 108655616/170498071 [00:15<00:12, 5066237.69it/s]
 64%|██████▍   | 109167616/170498071 [00:15<00:12, 5036601.16it/s]
 64%|██████▍   | 109674496/170498071 [00:15<00:12, 4944787.31it/s]
 65%|██████▍   | 110176256/170498071 [00:15<00:12, 4941863.07it/s]
 65%|██████▍   | 110776320/170498071 [00:15<00:11, 5226423.72it/s]
 65%|██████▌   | 111301632/170498071 [00:16<00:12, 4675291.66it/s]
 66%|██████▌   | 111780864/170498071 [00:16<00:13, 4481563.79it/s]
 66%|██████▌   | 112237568/170498071 [00:16<00:13, 4393981.38it/s]
 66%|██████▌   | 112696320/170498071 [00:16<00:13, 4408502.71it/s]
 66%|██████▋   | 113152000/170498071 [00:16<00:12, 4419190.31it/s]
 67%|██████▋   | 113680384/170498071 [00:16<00:12, 4584829.15it/s]
 67%|██████▋   | 114184192/170498071 [00:16<00:11, 4713251.43it/s]
 67%|██████▋   | 114727936/170498071 [00:16<00:11, 4884303.26it/s]
 68%|██████▊   | 115312640/170498071 [00:16<00:10, 5059743.58it/s]
 68%|██████▊   | 115896320/170498071 [00:17<00:10, 5245496.60it/s]
 68%|██████▊   | 116488192/170498071 [00:17<00:09, 5440043.32it/s]
 69%|██████▊   | 117080064/170498071 [00:17<00:09, 5565570.71it/s]
 69%|██████▉   | 117720064/170498071 [00:17<00:09, 5749172.20it/s]
 69%|██████▉   | 118416384/170498071 [00:17<00:08, 6082978.52it/s]
 70%|██████▉   | 119079936/170498071 [00:17<00:08, 6091875.79it/s]
 70%|███████   | 119760896/170498071 [00:17<00:08, 6296671.64it/s]
 71%|███████   | 120464384/170498071 [00:17<00:07, 6484688.99it/s]
 71%|███████   | 121240576/170498071 [00:17<00:07, 6704556.78it/s]
 72%|███████▏  | 122064896/170498071 [00:17<00:06, 7150629.93it/s]
 72%|███████▏  | 122814464/170498071 [00:18<00:06, 7252036.59it/s]
 72%|███████▏  | 123541504/170498071 [00:18<00:06, 7213918.30it/s]
 73%|███████▎  | 124304384/170498071 [00:18<00:06, 7292459.63it/s]
 73%|███████▎  | 125088768/170498071 [00:18<00:06, 7448502.59it/s]
 74%|███████▍  | 125904896/170498071 [00:18<00:05, 7658548.76it/s]
 74%|███████▍  | 126704640/170498071 [00:18<00:05, 7755227.43it/s]
 75%|███████▍  | 127512576/170498071 [00:18<00:05, 7823398.97it/s]
 75%|███████▌  | 128295936/170498071 [00:18<00:05, 7414036.79it/s]
 76%|███████▌  | 129042432/170498071 [00:18<00:05, 7114642.99it/s]
 76%|███████▌  | 129759232/170498071 [00:18<00:05, 7064020.59it/s]
 77%|███████▋  | 130468864/170498071 [00:19<00:05, 6976403.18it/s]
 77%|███████▋  | 131169280/170498071 [00:19<00:05, 6810005.70it/s]
 77%|███████▋  | 131872768/170498071 [00:19<00:05, 6857183.12it/s]
 78%|███████▊  | 132576256/170498071 [00:19<00:05, 6901295.58it/s]
 78%|███████▊  | 133287936/170498071 [00:19<00:05, 6961424.56it/s]
 79%|███████▊  | 134048768/170498071 [00:19<00:05, 7143292.71it/s]
 79%|███████▉  | 134811648/170498071 [00:19<00:04, 7286738.19it/s]
 80%|███████▉  | 135600128/170498071 [00:19<00:04, 7427928.01it/s]
 80%|████████  | 136400896/170498071 [00:19<00:04, 7599880.90it/s]
 80%|████████  | 137192448/170498071 [00:19<00:04, 7686281.79it/s]
 81%|████████  | 138056704/170498071 [00:20<00:04, 7931070.00it/s]
 81%|████████▏ | 138928128/170498071 [00:20<00:03, 8127341.52it/s]
 82%|████████▏ | 139776000/170498071 [00:20<00:03, 8213472.24it/s]
 82%|████████▏ | 140648448/170498071 [00:20<00:03, 8339451.04it/s]
 83%|████████▎ | 141512704/170498071 [00:20<00:03, 8384049.68it/s]
 84%|████████▎ | 142400512/170498071 [00:20<00:03, 8528548.11it/s]
 84%|████████▍ | 143344640/170498071 [00:20<00:03, 8759768.22it/s]
 85%|████████▍ | 144328704/170498071 [00:20<00:02, 9065066.21it/s]
 85%|████████▌ | 145344512/170498071 [00:20<00:02, 9389192.08it/s]
 86%|████████▌ | 146320384/170498071 [00:21<00:02, 9485647.28it/s]
 86%|████████▋ | 147269632/170498071 [00:21<00:03, 7707202.93it/s]
 87%|████████▋ | 148094976/170498071 [00:21<00:03, 6009208.75it/s]
 87%|████████▋ | 148786176/170498071 [00:21<00:04, 5312531.17it/s]
 88%|████████▊ | 149387264/170498071 [00:21<00:04, 5055286.02it/s]
 88%|████████▊ | 149939200/170498071 [00:21<00:04, 5032464.32it/s]
 88%|████████▊ | 150474752/170498071 [00:21<00:03, 5041765.43it/s]
 89%|████████▊ | 151002112/170498071 [00:22<00:03, 5068323.22it/s]
 89%|████████▉ | 151576576/170498071 [00:22<00:03, 5242792.52it/s]
 89%|████████▉ | 152120320/170498071 [00:22<00:03, 5269157.22it/s]
 90%|████████▉ | 152728576/170498071 [00:22<00:03, 5479061.77it/s]
 90%|████████▉ | 153344000/170498071 [00:22<00:03, 5647869.98it/s]
 90%|█████████ | 153984000/170498071 [00:22<00:02, 5857830.26it/s]
 91%|█████████ | 154592256/170498071 [00:22<00:02, 5909281.84it/s]
 91%|█████████ | 155296768/170498071 [00:22<00:02, 6191882.38it/s]
 91%|█████████▏| 155984896/170498071 [00:22<00:02, 6379890.61it/s]
 92%|█████████▏| 156684288/170498071 [00:22<00:02, 6560982.11it/s]
 92%|█████████▏| 157432832/170498071 [00:23<00:01, 6810775.69it/s]
 93%|█████████▎| 158184448/170498071 [00:23<00:01, 6990002.05it/s]
 93%|█████████▎| 158952448/170498071 [00:23<00:01, 7161570.98it/s]
 94%|█████████▎| 159792128/170498071 [00:23<00:01, 7523683.71it/s]
 94%|█████████▍| 160600064/170498071 [00:23<00:01, 7641673.09it/s]
 95%|█████████▍| 161392640/170498071 [00:23<00:01, 7509531.69it/s]
 95%|█████████▌| 162352128/170498071 [00:23<00:01, 8035241.33it/s]
 96%|█████████▌| 163184640/170498071 [00:23<00:00, 8107431.24it/s]
 96%|█████████▋| 164184064/170498071 [00:23<00:00, 8642957.24it/s]
 97%|█████████▋| 165050368/170498071 [00:23<00:00, 8643979.37it/s]
 97%|█████████▋| 165916672/170498071 [00:24<00:00, 7971571.93it/s]
 98%|█████████▊| 166724608/170498071 [00:24<00:00, 6944692.64it/s]
 98%|█████████▊| 167447552/170498071 [00:24<00:00, 6231694.98it/s]
 99%|█████████▊| 168099840/170498071 [00:24<00:00, 5975292.94it/s]
 99%|█████████▉| 168716288/170498071 [00:24<00:00, 5822420.21it/s]
 99%|█████████▉| 169310208/170498071 [00:24<00:00, 5814837.06it/s]
100%|█████████▉| 169900032/170498071 [00:24<00:00, 5797348.59it/s]
170499072it [00:24, 6837107.81it/s]                               
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
Extracting ./data_3/cifar-10-python.tar.gz to ./data_3
Files already downloaded and verified

  0%|          | 0.00/97.8M [00:00<?, ?B/s]
  0%|          | 32.0k/97.8M [00:00<05:16, 324kB/s]
  0%|          | 64.0k/97.8M [00:00<05:49, 293kB/s]
  0%|          | 240k/97.8M [00:00<01:52, 912kB/s] 
  0%|          | 432k/97.8M [00:00<01:19, 1.29MB/s]
  1%|          | 888k/97.8M [00:00<00:41, 2.44MB/s]
  2%|▏         | 1.77M/97.8M [00:00<00:21, 4.75MB/s]
  3%|▎         | 2.79M/97.8M [00:00<00:17, 5.54MB/s]
  6%|▌         | 5.75M/97.8M [00:00<00:08, 10.7MB/s]
  7%|▋         | 7.11M/97.8M [00:01<00:08, 11.0MB/s]
  8%|▊         | 8.10M/97.8M [00:01<00:10, 8.72MB/s]
 10%|█         | 10.2M/97.8M [00:01<00:09, 10.0MB/s]
 14%|█▎        | 13.2M/97.8M [00:01<00:06, 13.2MB/s]
 15%|█▍        | 14.5M/97.8M [00:01<00:06, 12.9MB/s]
 16%|█▌        | 15.7M/97.8M [00:01<00:07, 10.8MB/s]
 19%|█▉        | 18.4M/97.8M [00:01<00:05, 14.5MB/s]
 20%|██        | 19.9M/97.8M [00:02<00:05, 13.7MB/s]
 22%|██▏       | 21.8M/97.8M [00:02<00:05, 15.1MB/s]
 24%|██▍       | 23.3M/97.8M [00:02<00:05, 14.9MB/s]
 25%|██▌       | 24.9M/97.8M [00:02<00:06, 11.3MB/s]
 28%|██▊       | 27.8M/97.8M [00:02<00:04, 15.4MB/s]
 30%|███       | 29.6M/97.8M [00:02<00:05, 14.1MB/s]
 32%|███▏      | 31.1M/97.8M [00:03<00:05, 12.7MB/s]
 34%|███▍      | 33.3M/97.8M [00:03<00:04, 14.9MB/s]
 36%|███▌      | 34.9M/97.8M [00:03<00:04, 13.5MB/s]
 39%|███▊      | 37.7M/97.8M [00:03<00:03, 15.9MB/s]
 40%|████      | 39.3M/97.8M [00:03<00:04, 15.3MB/s]
 42%|████▏     | 41.4M/97.8M [00:03<00:03, 16.9MB/s]
 45%|████▌     | 44.1M/97.8M [00:03<00:02, 19.6MB/s]
 47%|████▋     | 46.0M/97.8M [00:03<00:03, 16.5MB/s]
 49%|████▉     | 47.8M/97.8M [00:04<00:03, 14.2MB/s]
 52%|█████▏    | 50.6M/97.8M [00:04<00:02, 17.3MB/s]
 54%|█████▎    | 52.5M/97.8M [00:04<00:02, 16.5MB/s]
 55%|█████▌    | 54.1M/97.8M [00:04<00:03, 13.0MB/s]
 58%|█████▊    | 57.0M/97.8M [00:04<00:02, 16.4MB/s]
 60%|██████    | 58.8M/97.8M [00:04<00:02, 15.3MB/s]
 63%|██████▎   | 61.9M/97.8M [00:04<00:01, 19.2MB/s]
 65%|██████▌   | 64.0M/97.8M [00:05<00:01, 18.5MB/s]
 67%|██████▋   | 65.9M/97.8M [00:05<00:02, 14.9MB/s]
 70%|██████▉   | 68.4M/97.8M [00:05<00:01, 17.1MB/s]
 72%|███████▏  | 70.2M/97.8M [00:05<00:02, 13.8MB/s]
 74%|███████▍  | 72.2M/97.8M [00:05<00:02, 13.2MB/s]
 77%|███████▋  | 75.1M/97.8M [00:05<00:01, 16.4MB/s]
 79%|███████▊  | 76.9M/97.8M [00:05<00:01, 15.7MB/s]
 80%|████████  | 78.6M/97.8M [00:06<00:01, 15.5MB/s]
 82%|████████▏ | 80.2M/97.8M [00:06<00:01, 14.4MB/s]
 85%|████████▍ | 83.0M/97.8M [00:06<00:00, 17.3MB/s]
 87%|████████▋ | 84.7M/97.8M [00:06<00:00, 14.7MB/s]
 88%|████████▊ | 86.2M/97.8M [00:06<00:00, 12.5MB/s]
 91%|█████████ | 88.6M/97.8M [00:06<00:00, 15.0MB/s]
 92%|█████████▏| 90.2M/97.8M [00:06<00:00, 14.6MB/s]
 95%|█████████▍| 92.4M/97.8M [00:07<00:00, 16.4MB/s]
 96%|█████████▌| 94.1M/97.8M [00:07<00:00, 14.0MB/s]
 98%|█████████▊| 95.6M/97.8M [00:07<00:00, 14.3MB/s]
 99%|█████████▉| 97.0M/97.8M [00:07<00:00, 11.7MB/s]
100%|██████████| 97.8M/97.8M [00:07<00:00, 13.6MB/s]
Traceback (most recent call last):
  File "train.py", line 98, in <module>
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x14d452a7f7d2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x14d452a7bf3f in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x14d4916465ef in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x14d491647571 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x14d4916475fb in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x14d491619142 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x14d491619142 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x14d4540db694 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x14d4540df729 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x341 (0x14d4540eaa81 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupWrapper::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x34 (0x14d4916440e4 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<std::weak_ptr<c10d::Logger> > const&) + 0x3b9 (0x14d491660889 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x7da13c (0x14d4a629513c in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0x1f77f2 (0x14d4a5cb27f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #31: __libc_start_main + 0xf3 (0x14d4a78f80b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

train-env.yaml:

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: nvidia_pytorch
build:
  path: ../../../data-science/environment/
tags:
  os: ubuntu
  os_version: 20.04
  hpcx: 2.10
  mpi: openmpi
  mpi_version: 4.1.2rc4
  ucx: 1.12.0
  cuda: 11.6.2
  cudnn: 8.4.0.27
  nccl: 2.12.10
  rdma_core: 36.0
  nsight_compute: 2022.1.1.2
  nsight_systems: "2022.2.1.31-5fe97ab"
  nccl_test: 2.11.0
  # azureml-defaults: 1.41.0
  # mlflow: 1.25.1
  azureml-defaults: 1.50.0
  mlflow: 2.3.2
  transformers: 4.18.0

and

requirements.txt

for local testing (cpu)

torchvision==0.12.0 torch==1.11.0 transformers==4.18.0

for metrics reporting/plotting

mlflow==1.25.1

azureml-mlflow==1.41.0

mlflow==2.3.2 azureml-mlflow==1.50.0 matplotlib==3.5.2 tqdm==4.64.0 psutil==5.9.0

for unit testing

pytest==7.1.2


and 
Dockerfile

check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html

FROM nvcr.io/nvidia/pytorch:22.04-py3

##############################################################################

NCCL TESTS

############################################################################## ENV NCCL_TESTS_TAG=v2.11.0

NOTE: adding gencodes to support K80, M60, V100, A100

RUN mkdir /tmp/nccltests && \ cd /tmp/nccltests && \ git clone -b ${NCCL_TESTS_TAG} https://github.com/NVIDIA/nccl-tests.git && \ cd nccl-tests && \ make \ MPI=1 MPI_HOME=/opt/hpcx/ompi \ NVCC_GENCODE="-gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ CUDA_HOME=/usr/local/cuda && \ cp ./build/* /usr/local/bin && \ rm -rf /tmp/nccltests

Install dependencies missing in this container

NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0

COPY requirements.txt ./ RUN pip install -r requirements.txt

add ndv4-topo.xml

RUN mkdir /opt/microsoft/ ADD ./ndv4-topo.xml /opt/microsoft

to use on A100, enable env var below in your job

ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

adjusts the level of info from NCCL tests

ENV NCCL_DEBUG="INFO" ENV NCCL_DEBUG_SUBSYS="GRAPH,INIT,ENV"

Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.

ENV NCCL_IB_PCI_RELAXED_ORDERING="1" ENV CUDA_DEVICE_ORDER="PCI_BUS_ID" ENV NCCL_SOCKET_IFNAME="eth0"

ENV NCCL_SOCKET_IFNAME='lo'

ENV NCCL_IB_DISABLE="1"

sjeaugey commented 1 year ago

Looks like a PyTorch error, which I'm not familiar with. I don't see any NCCL WARN, so it doesn't look like NCCL failed. Note you're using ndv4-topo.xml which is not the right topology for the Azure instance you are using. So you might be following a recipe that was not intended for this type of instance.

monajalal commented 1 year ago

@sjeaugey Good point Sylvain. I am new to this. Is there a resource you could suggest to use if my topology is like this for the ndv4-top.xml or is it necessary to use a topology file? Is it not able to automatically figure it?

I have 4 nodes in a cluster, each of these nodes have 4 K80 GPUs.

Screenshot from 2023-05-19 09-25-47

sjeaugey commented 1 year ago

ndv4-topo.xml is for the "NDv4" instance type. I don't think there is a file for that kind of instances, so the best is to not set NCCL_TOPO_FILE.

monajalal commented 1 year ago

@sjeaugey thanks for checking. Yes, I am not using it. It is commented (original templates mentioned to uncomment if using A100)

(base) mona@ard-gpu-01:~/ARD-ML-CIFAR10$ rg NCCL_TOPO_FILE
data-science/environment/Dockerfile
32:# ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

mlops/azureml/train/pipeline.yaml
52:      # NCCL_TOPO_FILE: "/opt/microsoft/ndv4-topo.xml" # Use specific topology file for A100

sjeaugey commented 1 year ago

Ok. Still back to my original point: it seems PyTorch is failing somehow in network communication. I don't know how to fix that though.

rajachan commented 1 year ago

From the log you posted, looks like c10d's out-of-band exchange of the ncclUniqueId over TCP is timing out. I don't think this has to do with NCCL. I am not familiar with Azure, but I would check things like Security Group rules to make sure the right ports are open. Are you able to run this with a different ProcessGroup in PyTorch (Gloo or MPI, perhaps)?

hassanzadeh commented 11 months ago

Hey @monajalal, Did you manage to fix the issue? I'm getting the same error, on an azure Compute node with 8xA100s with FSDP distributed training.

Maxpa1n commented 9 months ago

I tried changing the timeout limit and the problem was solved.

dist.init_process_group(backend="nccl", timeout=datetime.timedelta(days=2))

zhanwenchen commented 7 months ago

The cause in my case was that the codebase used the old args.local_rank syntax. Once we converted to the local_rank = int(os.environ['local_rank']). The issue disappeared. The timeout didn't work.

furkantrky commented 3 months ago

I am using SFT trainer to make lora-fine-tuning on llms and this error occur on mapping the dataset in SFTTrainer object, SFTTrainer tries to do automatically map your data when you try to create a trainer object and at that part if I use a little bit big dataset the code fails with this error. Note: It was working on small dataset the problem occurs when the data size is getting bigger. So in my case none of the enviroment variables did not solve the problem. I solve the problem taking the SFTTrainers dataset map function to another part of the script, doing mapping there and feed the mapped dataset to the trainer object and also passing the "skip_prepare_dataset" parameter "True" to the SFTTrainer when creating the object. Examples in below. ---- Creating SFTTrainer Object ----

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_mapped,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
    dataset_kwargs={"skip_prepare_dataset":True}
)

---- Map Function That I Take From SFT To The Outside ----

def _prepare_non_packed_dataloader(
    tokenizer,
    dataset,
    dataset_text_field,
    max_seq_length,
    formatting_func=None,
    add_special_tokens=True,
    remove_unused_columns=True,
    ):
    use_formatting_func = formatting_func is not None and dataset_text_field is None
    _dataset_sanity_checked = False
    # Inspired from: https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt
    def tokenize(element):
        outputs = tokenizer(
            element[dataset_text_field] if not use_formatting_func else formatting_func(element),
            add_special_tokens=add_special_tokens,
            truncation=True,
            padding=False,
            max_length=max_seq_length,
            return_overflowing_tokens=False,
            return_length=False,
        )
        if use_formatting_func and not _dataset_sanity_checked:
            if not isinstance(formatting_func(element), list):
                raise ValueError(
                    "The `formatting_func` should return a list of processed strings since it can lead to silent bugs."
                )
            else:
                _dataset_sanity_checked = True
        return {"input_ids": outputs["input_ids"], "attention_mask": outputs["attention_mask"]}
    signature_columns = ["input_ids", "labels", "attention_mask"]
    extra_columns = list(set(dataset.column_names) - set(signature_columns))
    if not remove_unused_columns and len(extra_columns) > 0:
        warnings.warn(
            "You passed `remove_unused_columns=False` on a non-packed dataset. This might create some issues with the default collator and yield to errors. If you want to "
            f"inspect dataset other columns (in this case {extra_columns}), you can subclass `DataCollatorForLanguageModeling` in case you used the default collator and create your own data collator in order to inspect the unused dataset columns."
        )
    tokenized_dataset = dataset.map(
        tokenize,
        batched=True,
        remove_columns=dataset.column_names if remove_unused_columns else None,
        num_proc=None,
        batch_size=1000,
    )
    return tokenized_dataset

dataset_mapped = _prepare_non_packed_dataloader(tokenizer, dataset, "text", max_seq_length)

NVIDIA / nccl

RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:580 #845

for local testing (cpu)

for metrics reporting/plotting

mlflow==1.25.1

azureml-mlflow==1.41.0

for unit testing

check release notes https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html

NCCL TESTS

NOTE: adding gencodes to support K80, M60, V100, A100

Install dependencies missing in this container

NOTE: container already has matplotlib==3.5.1 tqdm==4.62.0

add ndv4-topo.xml

to use on A100, enable env var below in your job

ENV NCCL_TOPO_FILE="/opt/microsoft/ndv4-topo.xml"

adjusts the level of info from NCCL tests

Relaxed Ordering can greatly help the performance of Infiniband networks in virtualized environments.

ENV NCCL_SOCKET_IFNAME='lo'