Mellanox / nv_peer_memory

309 stars 62 forks source link

I encountered a segment error while transmitting with GPU address. #104

Open heaibao817 opened 2 years ago

heaibao817 commented 2 years ago

The GDB BackTrace is :

0 0x00007ffff6d16cb4 in __memcpy_ssse3_back () from /lib64/libc.so.6

1 0x00007ffc805e7b16 in copy_to_scat (scat=0x7ff9bc18f6e0, buf=buf@entry=0x7ff9bc1894c0, size=size@entry=0x7ffa167fe2ec,

max=max@entry=1, ctx=ctx@entry=0x1c1e8780) at ../providers/mlx5/qp.c:88

2 0x00007ffc805e7e07 in copy_to_scat (ctx=0x1c1e8780, max=1, size=0x7ffa167fe2ec, buf=0x7ff9bc1894c0, scat=)

at ../providers/mlx5/qp.c:78

3 mlx5_copy_to_send_wqe (qp=qp@entry=0x7ff9bc18a230, idx=, buf=0x7ff9bc1894c0, size=)

at ../providers/mlx5/qp.c:161

4 0x00007ffc805e51a4 in mlx5_parse_cqe (lazy=0, cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=,

cur_rsc=<synthetic pointer>, cqe=<optimized out>, cqe64=<optimized out>, cq=<optimized out>) at ../providers/mlx5/cq.c:743

5 mlx5_poll_one (cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=, cur_rsc=, cq=)

at ../providers/mlx5/cq.c:904

6 poll_cq (cqe_ver=1, wc=, ne=, ibcq=0x7ff9bc188d40) at ../providers/mlx5/cq.c:932

7 mlx5_poll_cq_v1 (ibcq=0x7ff9bc188d40, ne=32, wc=) at ../providers/mlx5/cq.c:1306

8 0x00007ffce1248ab2 in ibv_poll_cq (wc=0x7ffa167fe5a0, num_entries=32, cq=)

/include/infiniband/verbs.h:2456

It seems like the ibv_poll_cq failed. But when I change to cpu addr, this problem will not happen. I wonder what happened.

nnurlan008 commented 9 months ago

Hi @heaibao817,

I have a similar problem. In my case, I need to assign GPU buffer for completion queue. I have Tesla K40 and connectx-4. Nvidia_peermem is loaded. But I get segmentation fault - bad address error with GPU memory address (returned by cudaMalloc). However, this problem does not happen with CPU address (returned by malloc). I wonder if you have been able to solve the issue you mentioned and if so, how?

Many thanks in advance