Open chaoyanghe opened 3 years ago
It looks like you are using EFA because register_mr_buffers
is a function defined in aws-ofi-nccl plugin.
If it is the case, maybe open an issue in https://github.com/aws/aws-ofi-nccl.
Also try to run dmesg
to see if there are kernel level error message.
dmesg
result:
[ 7.333531] nvidia: module license 'NVIDIA' taints kernel. [64/1940]
[ 7.333531] Disabling lock debugging due to kernel taint
[ 7.333532] nvidia: module license 'NVIDIA' taints kernel.
[ 7.333533] nvidia: module license 'NVIDIA' taints kernel.
[ 7.345808] ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem fd014000, mac addr 0e:e2:cf:7d:1a:ff
[ 7.408229] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 7.449617] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
[ 7.457237] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
[ 7.504615] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
[ 7.599766] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 10
[ 7.755746] mousedev: PS/2 mouse device common for all mice
[ 7.828257] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 7.924954] Elastic Fabric Adapter (EFA) v1.10.2g
[ 8.038953] efa 0000:00:06.0: Setup irq:0xffff88ddba9459a0 vector:123 name:efa-mgmnt@pci:0000:00:06.0
[ 8.048374] infiniband efa_0: IB device registered
[ 8.150111] Neuron Driver Started with Version:1.1.2.0
[ 15.356571] RPC: Registered named UNIX socket transport module.
[ 15.365231] RPC: Registered udp transport module.
[ 15.369147] RPC: Registered tcp transport module.
[ 15.373190] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 15.484024] random: crng init done
[ 15.484026] random: 7 urandom warning(s) missed due to ratelimiting
[ 21.265449] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 21.271986] Bridge firewalling registered
[ 21.301224] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[ 22.417880] Initializing XFRM netlink socket
[ 22.429731] Netfilter messages via NETLINK v0.30.
[ 22.440908] ctnetlink v0.93: registering with nfnetlink.
[ 24.627980] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[ 27.914135] NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
[ 541.962521] libcfs: module is from the staging directory, the quality is unknown, you have been warned.
[ 541.973476] LNet: HW NUMA nodes: 2, HW CPU cores: 96, npartitions: 2
[ 541.977826] alg: No test for adler32 (adler32-zlib)
[ 542.744119] lnet: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.771058] obdclass: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.788107] Lustre: Lustre: Build Version: 2.10.8
[ 542.811380] ptlrpc: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.838343] ksocklnd: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.848721] LNet: Added LNI 10.0.87.143@tcp [8/256/0/180]
[ 542.851520] LNet: Accept secure, port 988
[ 542.867378] lov: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.880697] fid: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.897622] mdc: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.913663] fld: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.930234] lmv: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.955365] lustre: module is from the staging directory, the quality is unknown, you have been warned.
[ 542.976986] mgc: module is from the staging directory, the quality is unknown, you have been warned.
[ 543.010859] osc: module is from the staging directory, the quality is unknown, you have been warned.
[ 543.221195] Lustre: Mounted 4vozvbmv-client
[193073.072231] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[193073.190398] nvidia-uvm: Loaded the UVM driver, major device number 243.
[193073.219362] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[669689.122792] Lustre: 10177:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583759/real 1630583759] req@ffff893d76464b00 x1709089496410816/t0(0) o400->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 224/224 e 0 to 1 dl 1630583766 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669689.122812] Lustre: 4vozvbmv-MDT0000-mdc-ffff88ddb2378000: Connection to 4vozvbmv-MDT0000 (at 10.0.205.151@tcp) was lost; in progress operations using this service will wait for recovery to complete
[669689.144092] Lustre: 10177:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669689.165176] LustreError: 166-1: MGC10.0.205.151@tcp: Connection to MGS (at 10.0.205.151@tcp) was lost; in progress operations using this service will fail
[669694.242785] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583766/real 1630583766] req@ffff8938c5082700 x1709089496416160/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb
2378000@10.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583772 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669695.266788] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583767/real 1630583767] req@ffff8938c5080f00 x1709089496416176/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583773 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669720.866793] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583792/real 1630583792] req@ffff8938c5083600 x1709089496416208/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb
2378000@10.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583798 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669725.986801] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630583792/real 1630583792] req@ffff8938c5080c00 x1709089496416192/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583803 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[669752.610835] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583819/real 0] req@ffff8938cdda9500 x1709089496421552/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630583803 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [0/1940]
[669752.610835] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583819/real 0] req@ffff8938cdda9500 x1709089496421552/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583830 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669832.482918] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583899/real 0] req@ffff89357abe0600 x1709089496437520/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583910 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669832.504005] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669890.850961] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630583952/real 0] req@ffff89389755f800 x1709089496448176/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630583968 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669890.873292] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[669958.434964] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630584005/real 0] req@ffff8938cbc7e000 x1709089496458816/t0(0) o250->MGC10.0.205.151@tcp@10.0.205.151@tcp:2
6/25 lens 520/544 e 0 to 1 dl 1630584036 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[669958.454443] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[670108.963185] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630584165/real 0] req@ffff8938cbb37200 x1709089496490768/t0(0) o38->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@1
0.0.205.151@tcp:12/10 lens 520/544 e 0 to 1 dl 1630584186 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[670108.984529] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
[670381.155547] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1630584404/real 1630584458] req@ffff8938ad64f200 x1709089496538720/t0(0) o250->MGC10.0.205.151@tcp@10.0.2
05.151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630584459 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[670381.176961] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
[670937.156108] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1630584964/real 1630585014] req@ffff893d83ed0900 x1709089496650656/t0(0) o250->MGC10.0.205.151@tcp@10.0.2
05.151@tcp:26/25 lens 520/544 e 0 to 1 dl 1630585019 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[670937.177572] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 22 previous similar messages
[671099.175718] Lustre: Evicted from MGS (at 198.19.54.206@tcp1) after server handle changed from 0x5bc0911ee77e15fb to 0x1694543d7262fec
[671099.184834] LustreError: 10110:0:(client.c:3009:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88dd66d7ad00 x1709089395116832/t95103249569(95103249569) o101->4vozvbmv-MDT0000-mdc-ffff88ddb2378000@10.0.205.151@tcp:12/10
lens 992/544 e 0 to 0 dl 1630585348 ref 2 fl Interpret:RP/4/0 rc 301/301
[671099.184923] Lustre: MGC10.0.205.151@tcp: Connection restored to 198.19.54.206@tcp1 (at 10.0.205.151@tcp)
[671521.495128] Lustre: 4vozvbmv-MDT0000-mdc-ffff88ddb2378000: Connection restored to 10.0.205.151@tcp (at 10.0.205.151@tcp)
[678853.930111] Lustre: 10171:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630592924/real 1630592924] req@ffff8938aefeb900 x1709089498252000/t0(0) o400->4vozvbmv-OST00e3-osc-ffff88dd
b2378000@10.0.220.229@tcp:28/4 lens 224/224 e 0 to 1 dl 1630592931 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[678853.930135] Lustre: 4vozvbmv-OST00e2-osc-ffff88ddb2378000: Connection to 4vozvbmv-OST00e2 (at 10.0.220.229@tcp) was lost; in progress operations using this service will wait for recovery to complete
[678853.952118] Lustre: 10171:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
[678944.042170] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1630593010/real 1630593010] req@ffff893d27b93600 x1709089498268048/t0(0) o8->4vozvbmv-OST00e2-osc-ffff88ddb2
378000@10.0.220.229@tcp:28/4 lens 520/544 e 0 to 1 dl 1630593021 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[678944.063418] Lustre: 10110:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
[679238.188040] Lustre: 4vozvbmv-OST00e3-osc-ffff88ddb2378000: Connection restored to 198.19.32.207@tcp1 (at 10.0.220.229@tcp)
[1038619.687627] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.695161] infiniband efa_0: Failed to register mr [-12]
[1038619.695196] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.700624] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.707090] infiniband efa_0: Failed to register mr [-12]
[1038619.707145] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.707147] infiniband efa_0: Failed to register mr [-12]
[1038619.714600] infiniband efa_0: Failed to register mr [-12]
[1038619.714602] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.714604] infiniband efa_0: Failed to register mr [-12]
[1038619.726444] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.726511] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1038619.726512] infiniband efa_0: Failed to register mr [-12]
[1038619.766218] infiniband efa_0: Failed to register mr [-12]
[1088198.846130] python3[41849]: segfault at 10 ip 00007f6ac5b5a1af sp 00007ffdd32cdd80 error 4 in libc-2.27.so[7f6ac5ac5000+1e7000]
[1090746.882487] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.889954] infiniband efa_0: Failed to register mr [-12]
[1090746.889990] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.895063] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.901576] infiniband efa_0: Failed to register mr [-12]
[1090746.901584] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.901587] infiniband efa_0: Failed to register mr [-12]
[1090746.909000] infiniband efa_0: Failed to register mr [-12]
[1090746.909007] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.914505] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920860] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920904] infiniband efa_0: Failed to register mr [-12]
[1090746.920939] infiniband efa_0: Failed to process command REG_MR (opcode 7) comp_status 1 err -12
[1090746.920941] infiniband efa_0: Failed to register mr [-12]
[1090746.925258] infiniband efa_0: Failed to register mr [-12]
[1090746.972127] infiniband efa_0: Failed to register mr [-12]
export OMP_NUM_THREADS=8
export NCCL_NSOCKS_PERTHREAD=8
export NCCL_SOCKET_NTHREADS=8
when world_size = 64, is this setting too big?
NCCL_NSOCKS_PERTHREAD=8 -x NCCL_SOCKET_NTHREADS=8 in your command line is causing each connection to create 64 sockets (each socket using one file), hence
64 sockets per connection 8 processes per node (64-8) remote ranks * 2 (send/receive) = 57344
Hi, I am a little confused how sockets play in this scenario. If you are using EFA, then you are not using sockets. Or Am I missing something?
Hi, I am a little confused how sockets play in this scenario. If you are using EFA, then you are not using sockets. Or Am I missing something?
I see. I checked this by setting them to 1, the same error happened.
@wzamazon I also created an issue at EFA's github:
https://github.com/aws/aws-ofi-nccl/issues/70
If you are in Amazon, can we create a channel to debug? I know EFA's engineer. What's your internal ID at Amazon?
this issue is finally resolved by setting NCCL_BUFFSIZE=2097152
.
I was stuck with this problem for many days. May I know how to debug such issues?
Currently, I am using all_to_all NCCL MPI (world_size=64). The error is as follows.