cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

How to use the GDRDMA on the ROCm platform? #63

Open flyingdown opened 2 years ago

flyingdown commented 2 years ago

Dose the opa-psm2 support GDRDMA on the ROCm platform, and I have to do what to enable GDRDMA ? I use opa-psm2-PSM2_11.2.NCCL and psm2-nccl plugin on ROCm platform, env set like this: export PSM2_GPUDIRECT=1 export PSM2_CUDA=1

run with rccl-tests and got the error:

node37.219242 Unhandled error in TID Update: Bad address

[node37:219242] Process received signal [node37:219242] Signal: Aborted (6) [node37:219242] Signal code: (-6) [node37:219242] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b7bc83a65d0] [node37:219242] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b7bcdacd207] [node37:219242] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b7bcdace8f8] [node37:219242] [ 3] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x16054)[0x2b7decdf3054] [node37:219242] [ 4] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x1660d)[0x2b7decdf360d] [node37:219242] [ 5] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x37e7c)[0x2b7dece14e7c] [node37:219242] [ 6] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3824c)[0x2b7dece1524c] [node37:219242] [ 7] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x34331)[0x2b7dece11331] [node37:219242] [ 8] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x34898)[0x2b7dece11898] [node37:219242] [ 9] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3b793)[0x2b7dece18793] [node37:219242] [10] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3c690)[0x2b7dece19690] [node37:219242] [11] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x28c8f)[0x2b7dece05c8f] [node37:219242] [12] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x2653c)[0x2b7dece0353c] [node37:219242] [13] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x23f8f)[0x2b7dece00f8f] [node37:219242] [14] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(psm2_mq_ipeek+0x7c)[0x2b7decdfaeec] [node37:219242] [15] /home/fd/psm2-nccl-master/librccl-net.so(psm2_nccl_test+0xb3)[0x2b7e23a027b3]

Debug the core file:

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/home/fd/rccl-tests-master/build/all_gather_perf --minbytes=2621'. Program terminated with signal SIGABRT, Aborted.

0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6

[Current thread is 1 (Thread 0x2b7f14a00700 (LWP 219289))] Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.172-2.el7.x86_64 elfutils-libs-0.172-2.el7.x86_64 glibc-2.17-260.el7.x86_64 infinipath-psm-3.3-26_g604758e_open.2.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libibverbs-17.2-3.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-17.2-3.el7.x86_64 libstdc++-4.8.5-36.el7.x86_64 libuuid-2.23.2-59.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 systemd-libs-219-62.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt

0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6

1 0x00002b7bcdace8f8 in abort () from /lib64/libc.so.6

2 0x00002b7decdf3054 in psmi_errhandler_psm (ep=ep@entry=0x0, err=err@entry=PSM2_INTERNAL_ERR, error_string=error_string@entry=0x2b7f149f9acc " Unhandled error in TID Update: Bad address\n", token=token@entry=0x2b7f149f9ac0)

at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:96

3 0x00002b7decdf360d in psmi_handle_error (ep=0xfffffffffffffffe, error=PSM2_INTERNAL_ERR, buf=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:183

4 0x00002b7dece14e7c in ips_tidcache_register (tidc=tidc@entry=0x2b7e303bf458, start=start@entry=47820958728192, length=131072, firstidx=firstidx@entry=0x2b7f149f9e4c)

at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:221

5 0x00002b7dece1524c in ips_tidcache_acquire (tidc=tidc@entry=0x2b7e303bf458, buf=0x2b7e2f420000, length=length@entry=0x2b7f149f9ef0, tid_array=tid_array@entry=0x2b7e303bf734, tidcnt=tidcnt@entry=0x2b7f149f9ef4,

tidoff=tidoff@entry=0x2b7f149f9eec) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:471

6 0x00002b7dece11331 in ips_tid_recv_alloc_frag (nbytes_this=131072, tidrecvc=0x2b7e303bf650, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:1969

7 ips_tid_recv_alloc (ptidrecvc=, nbytes_this=131072, getreq=, ipsaddr=0x2b7e30853210, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2135

8 ips_tid_pendtids_timer_callback (timer=timer@entry=0x2b7e303bf610, current=current@entry=0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2379

9 0x00002b7dece11898 in ips_protoexp_tid_get_from_token (protoexp=0x2b7e303bf440, buf=0x2b7e2f420000, length=2097152, epaddr=0x2b7e30853210, remote_tok=1023, flags=, callback=0x2b7dece16b50 ,

context=0x2b7e301b9920) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:587

10 0x00002b7dece18793 in ips_proto_mq_rts_match_callback (req=0x2b7e301b9920, was_posted=was_posted@entry=1) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1152

11 0x00002b7dece19690 in ips_proto_mq_handle_rts (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1536

12 0x00002b7dece05c8f in ips_proto_process_packet (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_help.h:555

13 ips_recvhdrq_progress (recvq=0x2b7e301bfb98) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_recvhdrq.c:543

14 0x00002b7dece0353c in ips_ptl_poll (ptl_gen=0x2b7e301b9e80, _ignored=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ptl.c:541

15 0x00002b7dece00f8f in __psmi_poll_internal (ep=0x2b7e301b9ac0, poll_amsh=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm.c:1071

16 0x00002b7decdfaeec in psmi_mq_ipeek_inner (status_copy=, status=0x0, oreq=0x2b7f149fa438, mq=0x2b7e3010bf80) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1135

17 __psm2_mq_ipeek (mq=0x2b7e3010bf80, oreq=0x2b7f149fa438, status=0x0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1174

18 0x00002b7e23a027b3 in psm2_nccl_test () from /home/fd/psm2-nccl-master/librccl-net.so

19 0x00002b7bc88fee2d in ncclNetTest (request=0x3586a, done=0x2b7f149fa4f4, size=0x2b7f149fa4cc) at /home/fd/rccl-dtk-21.10/src/include/net.h:29

20 netRecvProxy (args=) at /home/fd/rccl-dtk-21.10/src/transport/net.cc:516

21 0x00002b7bc8916de4 in progressOps (state=, opsPtr=, idle=, comm=) at /home/fd/rccl-dtk-21.10/src/proxy.cc:342

22 persistentThread (comm_=0x2b7e30000c00) at /home/fd/rccl-dtk-21.10/src/proxy.cc:440

23 0x00002b7bc839edd5 in start_thread () from /lib64/libpthread.so.0

24 0x00002b7bcdb94ead in clone () from /lib64/libc.so.6

(gdb)

BrendanCunningham commented 2 years ago

@flyingdown Unfortunately opa-psm2 does not support ROCm GDRDMA.

flyingdown commented 2 years ago

Thanks for your reply. I am not sure whether the adaptation work should be done by opa or rocm, if opa, Is there any plan for that in the future?