NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.22k stars 810 forks source link

NCCL Unable to register memory #563

Open chaoyanghe opened 3 years ago

chaoyanghe commented 3 years ago

I was stuck with this problem for many days. May I know how to debug such issues?

Currently, I am using all_to_all NCCL MPI (world_size=64). The error is as follows.

10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] register_mr_buffers:465 NCCL WARN NET/OFI Unable to register memory (type = 1) for device 0. RC: 12, Error: Cannot allocate memory
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO include/net.h:23 -> 2
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO transport/net.cc:223 -> 2
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO transport.cc:111 -> 2
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:298 [5] NCCL INFO group.cc:137 -> 2 [Async thread]
10.0.87.143: 10.0.87.143: ip-10-0-87-143:149:149 [5] NCCL INFO group.cc:210 -> 2
10.0.87.143: 10.0.87.143: Traceback (most recent call last):
10.0.87.143: 10.0.87.143:   File "main_moe.py", line 187, in <module>
10.0.87.143: 10.0.87.143:     main()
10.0.87.143: 10.0.87.143:   File "main_moe.py", line 158, in main
10.0.87.143: 10.0.87.143:     output = model(x)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
10.0.87.143: 10.0.87.143:     return forward_call(*input, **kwargs)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 801, in forward
10.0.87.143: 10.0.87.143:     output = self.module(*inputs, **kwargs)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
10.0.87.143: 10.0.87.143:     return forward_call(*input, **kwargs)
10.0.87.143: 10.0.87.143:   File "/home/deepspeed/.local/lib/python3.6/site-packages/moe_layer/model/standard_moe_layer.py", line 45, in forward
10.0.87.143: 10.0.87.143:     h = self.moe_layer(hidden_states)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
10.0.87.143: 10.0.87.143:     return forward_call(*input, **kwargs)
10.0.87.143: 10.0.87.143:   File "/home/deepspeed/.local/lib/python3.6/site-packages/moe_layer/model/switch_transformer_layers.py", line 170, in forward
10.0.87.143: 10.0.87.143:     expert_inputs = _AllToAll.apply(self.process_group, torch.cat(route_inputs))
10.0.87.143: 10.0.87.143:   File "/home/deepspeed/.local/lib/python3.6/site-packages/moe_layer/model/switch_transformer_layers.py", line 25, in forward
10.0.87.143: 10.0.87.143:     dist.all_to_all_single(output, input, group=group)
10.0.87.143: 10.0.87.143:   File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 2386, in all_to_all_single
10.0.87.143: 10.0.87.143:     work = group.alltoall_base(output, input, output_split_sizes, input_split_sizes, opts)
10.0.87.143: 10.0.87.143: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled system error, NCCL version 21.0.3
10.0.87.143: 10.0.87.143: ncclSystemError: System call (socket, malloc, munmap, etc) failed.
10.0.87.143: 10.0.78.113: ip-10-0-78-113:69:69 [2] bootstrap.cc:528 NCCL WARN Unexpected connections are not empty
10.0.87.143: 10.0.78.113: ip-10-0-78-113:69:69 [2] NCCL INFO init.cc:202 -> 3
10.0.87.143: 10.0.78.113: ip-10-0-78-113:69:69 [2] NCCL INFO init.cc:990 -> 3
10.0.87.143: 10.0.78.113: terminate called after throwing an instance of 'std::runtime_error'
10.0.87.143: 10.0.78.113:   what():  NCCL error in: /tmp/pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:158, internal error, NCCL version 21.0.3
10.0.87.143: 10.0.78.113: ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption
10.0.87.143: 10.0.72.60:     output = model(x)
10.0.87.143: 10.0.72.60:   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
wzamazon commented 3 years ago

It looks like you are using EFA because register_mr_buffers is a function defined in aws-ofi-nccl plugin.

If it is the case, maybe open an issue in https://github.com/aws/aws-ofi-nccl.

wzamazon commented 3 years ago

Also try to run dmesg to see if there are kernel level error message.

chaoyanghe commented 3 years ago

dmesg result:

[    7.333531] nvidia: module license 'NVIDIA' taints kernel.                                                                                                                                                                        [64/1940]
[    7.333531] Disabling lock debugging due to kernel taint
[    7.333532] nvidia: module license 'NVIDIA' taints kernel.
[    7.333533] nvidia: module license 'NVIDIA' taints kernel.
[    7.345808] ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem fd014000, mac addr 0e:e2:cf:7d:1a:ff
[    7.408229] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    7.449617] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
[    7.457237] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
[    7.504615] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
[    7.599766] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
[    7.755746] mousedev: PS/2 mouse device common for all mice
[    7.828257] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.80.02  Wed Sep 23 01:13:39 UTC 2020
[    7.924954] Elastic Fabric Adapter (EFA) v1.10.2g
[    8.038953] efa 0000:00:06.0: Setup irq:0xffff88ddba9459a0 vector:123 name:efa-mgmnt@pci:0000:00:06.0
[    8.048374] infiniband efa_0: IB device registered
[    8.150111] Neuron Driver Started with Version:1.1.2.0
[   15.356571] RPC: Registered named UNIX socket transport module.
[   15.365231] RPC: Registered udp transport module.
[   15.369147] RPC: Registered tcp transport module.
[   15.373190] RPC: Registered tcp NFSv4.1 backchannel transport module.
[   15.484024] random: crng init done
[   15.484026] random: 7 urandom warning(s) missed due to ratelimiting
[   21.265449] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   21.271986] Bridge firewalling registered
[   21.301224] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[   22.417880] Initializing XFRM netlink socket
[   22.429731] Netfilter messages via NETLINK v0.30.
[   22.440908] ctnetlink v0.93: registering with nfnetlink.
[   24.627980] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[   27.914135] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
[  541.962521] libcfs: module is from the staging directory, the quality is unknown, you have been warned.
[  541.973476] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
[  541.977826] alg: No test for adler32 (adler32-zlib)
[  542.744119] lnet: module is from the staging directory, the quality is unknown, you have been warned.
[  542.771058] obdclass: module is from the staging directory, the quality is unknown, you have been warned.
[  542.788107] Lustre: Lustre: Build Version: 2.10.8
[  542.811380] ptlrpc: module is from the staging directory, the quality is unknown, you have been warned.
[  542.838343] ksocklnd: module is from the staging directory, the quality is unknown, you have been warned.
[  542.848721] LNet: Added LNI 10.0.87.143@tcp [8/256/0/180]
[  542.851520] LNet: Accept secure, port 988
[  542.867378] lov: module is from the staging directory, the quality is unknown, you have been warned.
[  542.880697] fid: module is from the staging directory, the quality is unknown, you have been warned.
[  542.897622] mdc: module is from the staging directory, the quality is unknown, you have been warned.
[  542.913663] fld: module is from the staging directory, the quality is unknown, you have been warned.
[  542.930234] lmv: module is from the staging directory, the quality is unknown, you have been warned.
[  542.955365] lustre: module is from the staging directory, the quality is unknown, you have been warned.
[  542.976986] mgc: module is from the staging directory, the quality is unknown, you have been warned.
[  543.010859] osc: module is from the staging directory, the quality is unknown, you have been warned.
[  543.221195] Lustre: Mounted 4vozvbmv-client
[193073.072231] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[193073.190398] nvidia-uvm: Loaded the UVM driver, major device number 243.
[193073.219362] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  450.80.02  Wed Sep 23 00:48:09 UTC 2020
[669689.122792] Lustre: 10177:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583759/real 1630583759]  req@ffff893d76464b00 x1709089496410816/t0(0) o400->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 224/224 e 0 to 1 dl 1630583766 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669689.122812] Lustre: 4vozvbmv-MDT0000-mdc-ffff88ddb2378000: Connection to 4vozvbmv-MDT0000 (at 10.0.205.151@tcp) was lost; in progress operations using this service will wait for recovery to complete
[669689.144092] Lustre: 10177:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669689.165176] LustreError: 166-1: MGC10.0.205.151@tcp: Connection to MGS (at 10.0.205.151@tcp) was lost; in progress operations using this service will fail
[669694.242785] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583766/real 1630583766]  req@ffff8938c5082700 x1709089496416160/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb
2378000@10.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583772 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669695.266788] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583767/real 1630583767]  req@ffff8938c5080f00 x1709089496416176/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583773 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669720.866793] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583792/real 1630583792]  req@ffff8938c5083600 x1709089496416208/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb
2378000@10.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583798 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669725.986801] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583792/real 1630583792]  req@ffff8938c5080c00 x1709089496416192/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583803 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669752.610835] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583819/real 0]  req@ffff8938cdda9500 x1709089496421552/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583803 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1                                                                                                                                                  [0/1940]
[669752.610835] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583819/real 0]  req@ffff8938cdda9500 x1709089496421552/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583830 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669832.482918] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583899/real 0]  req@ffff89357abe0600 x1709089496437520/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583910 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669832.504005] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669890.850961] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583952/real 0]  req@ffff89389755f800 x1709089496448176/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583968 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669890.873292] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669958.434964] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630584005/real 0]  req@ffff8938cbc7e000 x1709089496458816/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.151@tcp:2
6/25 lens 520/544 e 0 to 1 dl 1630584036 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669958.454443] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[670108.963185] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630584165/real 0]  req@ffff8938cbb37200 x1709089496490768/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630584186 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[670108.984529] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[670381.155547] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1630584404/real 1630584458]  req@ffff8938ad64f200 x1709089496538720/t0(0) o250->MGC10.0.205.151@tcp@10.0.2
05.151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630584459 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[670381.176961] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
[670937.156108] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1630584964/real 1630585014]  req@ffff893d83ed0900 x1709089496650656/t0(0) o250->MGC10.0.205.151@tcp@10.0.2
05.151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630585019 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[670937.177572] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
[671099.175718] Lustre: Evicted from MGS (at 198.19.54.206@tcp1) after server handle changed from 0x5bc0911ee77e15fb to 0x1694543d7262fec
[671099.184834] LustreError: 10110:0:(client.c:3009:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88dd66d7ad00 x1709089395116832/t95103249569(95103249569) o101->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@10.0.205.151@tcp:12/10 
lens 992/544 e 0 to 0 dl 1630585348 ref 2 fl Interpret:RP/4/0 rc 301/301
[671099.184923] Lustre: MGC10.0.205.151@tcp: Connection restored to 198.19.54.206@tcp1 (at 10.0.205.151@tcp)
[671521.495128] Lustre: 4vozvbmv-MDT0000-mdc-ffff88ddb2378000: Connection restored to 10.0.205.151@tcp (at 10.0.205.151@tcp)
[678853.930111] Lustre: 10171:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630592924/real 1630592924]  req@ffff8938aefeb900 x1709089498252000/t0(0) o400->4vozvbmv-OST00e3-osc-ffff88dd
b2378000@10.0.220.229@tcp:28/4 lens 224/224 e 0 to 1 dl 1630592931 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[678853.930135] Lustre: 4vozvbmv-OST00e2-osc-ffff88ddb2378000: Connection to 4vozvbmv-OST00e2 (at 10.0.220.229@tcp) was lost; in progress operations using this service will wait for recovery to complete
[678853.952118] Lustre: 10171:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
[678944.042170] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630593010/real 1630593010]  req@ffff893d27b93600 x1709089498268048/t0(0) o8->4vozvbmv-OST00e2-osc-ffff88ddb2
378000@10.0.220.229@tcp:28/4 lens 520/544 e 0 to 1 dl 1630593021 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[678944.063418] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
[679238.188040] Lustre: 4vozvbmv-OST00e3-osc-ffff88ddb2378000: Connection restored to 198.19.32.207@tcp1 (at 10.0.220.229@tcp)
[1038619.687627] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.695161] infiniband efa_0: Failed to register mr [-12]
[1038619.695196] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.700624] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.707090] infiniband efa_0: Failed to register mr [-12]
[1038619.707145] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.707147] infiniband efa_0: Failed to register mr [-12]
[1038619.714600] infiniband efa_0: Failed to register mr [-12]
[1038619.714602] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.714604] infiniband efa_0: Failed to register mr [-12]
[1038619.726444] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.726511] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.726512] infiniband efa_0: Failed to register mr [-12]
[1038619.766218] infiniband efa_0: Failed to register mr [-12]
[1088198.846130] python3[41849]: segfault at 10 ip 00007f6ac5b5a1af sp 00007ffdd32cdd80 error 4 in libc-2.27.so[7f6ac5ac5000+1e7000]
[1090746.882487] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.889954] infiniband efa_0: Failed to register mr [-12]
[1090746.889990] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.895063] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.901576] infiniband efa_0: Failed to register mr [-12]
[1090746.901584] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.901587] infiniband efa_0: Failed to register mr [-12]
[1090746.909000] infiniband efa_0: Failed to register mr [-12]
[1090746.909007] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.914505] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920860] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920904] infiniband efa_0: Failed to register mr [-12]
[1090746.920939] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920941] infiniband efa_0: Failed to register mr [-12]
[1090746.925258] infiniband efa_0: Failed to register mr [-12]
[1090746.972127] infiniband efa_0: Failed to register mr [-12]
chaoyanghe commented 3 years ago
export OMP_NUM_THREADS=8
export NCCL_NSOCKS_PERTHREAD=8
export NCCL_SOCKET_NTHREADS=8

when world_size = 64, is this setting too big?

chaoyanghe commented 3 years ago

NCCL_NSOCKS_PERTHREAD=8 -x NCCL_SOCKET_NTHREADS=8 in your command line is causing each connection to create 64 sockets (each socket using one file), hence

64 sockets per connection 8 processes per node (64-8) remote ranks * 2 (send/receive) = 57344

wzamazon commented 3 years ago

Hi, I am a little confused how sockets play in this scenario. If you are using EFA, then you are not using sockets. Or Am I missing something?

chaoyanghe commented 3 years ago

Hi, I am a little confused how sockets play in this scenario. If you are using EFA, then you are not using sockets. Or Am I missing something?

I see. I checked this by setting them to 1, the same error happened.

chaoyanghe commented 3 years ago

@wzamazon I also created an issue at EFA's github:

https://github.com/aws/aws-ofi-nccl/issues/70

If you are in Amazon, can we create a channel to debug? I know EFA's engineer. What's your internal ID at Amazon?

chaoyanghe commented 3 years ago

https://github.com/NVIDIA/nccl-tests/issues/78

chaoyanghe commented 3 years ago

https://github.com/pytorch/pytorch/pull/42514/files#diff-5e6ecb5ae373ed102c2d012ea8847336R21

chaoyanghe commented 3 years ago

this issue is finally resolved by setting NCCL_BUFFSIZE=2097152.