Dose the opa-psm2 support GDRDMA on the ROCm platform, and I have to do what to enable GDRDMA ?
I use opa-psm2-PSM2_11.2.NCCL and psm2-nccl plugin on ROCm platform, env set like this:
export PSM2_GPUDIRECT=1export PSM2_CUDA=1
run with rccl-tests and got the error:
node37.219242 Unhandled error in TID Update: Bad address
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/fd/rccl-tests-master/build/all_gather_perf --minbytes=2621'.
Program terminated with signal SIGABRT, Aborted.
0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6
0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6
1 0x00002b7bcdace8f8 in abort () from /lib64/libc.so.6
2 0x00002b7decdf3054 in psmi_errhandler_psm (ep=ep@entry=0x0, err=err@entry=PSM2_INTERNAL_ERR, error_string=error_string@entry=0x2b7f149f9acc " Unhandled error in TID Update: Bad address\n", token=token@entry=0x2b7f149f9ac0)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:96
3 0x00002b7decdf360d in psmi_handle_error (ep=0xfffffffffffffffe, error=PSM2_INTERNAL_ERR, buf=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:183
4 0x00002b7dece14e7c in ips_tidcache_register (tidc=tidc@entry=0x2b7e303bf458, start=start@entry=47820958728192, length=131072, firstidx=firstidx@entry=0x2b7f149f9e4c)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:221
5 0x00002b7dece1524c in ips_tidcache_acquire (tidc=tidc@entry=0x2b7e303bf458, buf=0x2b7e2f420000, length=length@entry=0x2b7f149f9ef0, tid_array=tid_array@entry=0x2b7e303bf734, tidcnt=tidcnt@entry=0x2b7f149f9ef4,
tidoff=tidoff@entry=0x2b7f149f9eec) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:471
6 0x00002b7dece11331 in ips_tid_recv_alloc_frag (nbytes_this=131072, tidrecvc=0x2b7e303bf650, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:1969
7 ips_tid_recv_alloc (ptidrecvc=, nbytes_this=131072, getreq=, ipsaddr=0x2b7e30853210, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2135
8 ips_tid_pendtids_timer_callback (timer=timer@entry=0x2b7e303bf610, current=current@entry=0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2379
context=0x2b7e301b9920) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:587
10 0x00002b7dece18793 in ips_proto_mq_rts_match_callback (req=0x2b7e301b9920, was_posted=was_posted@entry=1) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1152
11 0x00002b7dece19690 in ips_proto_mq_handle_rts (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1536
12 0x00002b7dece05c8f in ips_proto_process_packet (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_help.h:555
13 ips_recvhdrq_progress (recvq=0x2b7e301bfb98) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_recvhdrq.c:543
14 0x00002b7dece0353c in ips_ptl_poll (ptl_gen=0x2b7e301b9e80, _ignored=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ptl.c:541
15 0x00002b7dece00f8f in __psmi_poll_internal (ep=0x2b7e301b9ac0, poll_amsh=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm.c:1071
16 0x00002b7decdfaeec in psmi_mq_ipeek_inner (status_copy=, status=0x0, oreq=0x2b7f149fa438, mq=0x2b7e3010bf80) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1135
17 __psm2_mq_ipeek (mq=0x2b7e3010bf80, oreq=0x2b7f149fa438, status=0x0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1174
18 0x00002b7e23a027b3 in psm2_nccl_test () from /home/fd/psm2-nccl-master/librccl-net.so
19 0x00002b7bc88fee2d in ncclNetTest (request=0x3586a, done=0x2b7f149fa4f4, size=0x2b7f149fa4cc) at /home/fd/rccl-dtk-21.10/src/include/net.h:29
20 netRecvProxy (args=) at /home/fd/rccl-dtk-21.10/src/transport/net.cc:516
21 0x00002b7bc8916de4 in progressOps (state=, opsPtr=, idle=, comm=) at /home/fd/rccl-dtk-21.10/src/proxy.cc:342
22 persistentThread (comm_=0x2b7e30000c00) at /home/fd/rccl-dtk-21.10/src/proxy.cc:440
23 0x00002b7bc839edd5 in start_thread () from /lib64/libpthread.so.0
24 0x00002b7bcdb94ead in clone () from /lib64/libc.so.6
Dose the opa-psm2 support GDRDMA on the ROCm platform, and I have to do what to enable GDRDMA ? I use opa-psm2-PSM2_11.2.NCCL and psm2-nccl plugin on ROCm platform, env set like this:
export PSM2_GPUDIRECT=1
export PSM2_CUDA=1
run with rccl-tests and got the error:
Debug the core file: