Open heaibao817 opened 2 years ago
Hi @heaibao817,
I have a similar problem. In my case, I need to assign GPU buffer for completion queue. I have Tesla K40 and connectx-4. Nvidia_peermem is loaded. But I get segmentation fault - bad address error with GPU memory address (returned by cudaMalloc). However, this problem does not happen with CPU address (returned by malloc). I wonder if you have been able to solve the issue you mentioned and if so, how?
Many thanks in advance
The GDB BackTrace is :
0 0x00007ffff6d16cb4 in __memcpy_ssse3_back () from /lib64/libc.so.6
1 0x00007ffc805e7b16 in copy_to_scat (scat=0x7ff9bc18f6e0, buf=buf@entry=0x7ff9bc1894c0, size=size@entry=0x7ffa167fe2ec,
2 0x00007ffc805e7e07 in copy_to_scat (ctx=0x1c1e8780, max=1, size=0x7ffa167fe2ec, buf=0x7ff9bc1894c0, scat=)
3 mlx5_copy_to_send_wqe (qp=qp@entry=0x7ff9bc18a230, idx=, buf=0x7ff9bc1894c0, size=)
4 0x00007ffc805e51a4 in mlx5_parse_cqe (lazy=0, cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=,
5 mlx5_poll_one (cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=, cur_rsc=, cq=)
6 poll_cq (cqe_ver=1, wc=, ne=, ibcq=0x7ff9bc188d40) at ../providers/mlx5/cq.c:932
7 mlx5_poll_cq_v1 (ibcq=0x7ff9bc188d40, ne=32, wc=) at ../providers/mlx5/cq.c:1306
8 0x00007ffce1248ab2 in ibv_poll_cq (wc=0x7ffa167fe5a0, num_entries=32, cq=)
It seems like the ibv_poll_cq failed. But when I change to cpu addr, this problem will not happen. I wonder what happened.