Open tangrc99 opened 1 year ago
@tangrc99 this expected as the implementation of ibv_reg_mr in the Linux kernel requires the virtual address range to be backed by CPU memory pages.
More exactly, pin_user_pages does not work on CPU mappings of PCIe resources created via io_remap_pfn_range.
The official way of enabling RDMA on GPU memory is:
For a full deployment case, see for example https://github.com/openucx/ucx/blob/1308d2055ab0ba948eac213c8cfcd92776c34a53/src/uct/cuda/cuda_copy/cuda_copy_md.c#L410 and https://github.com/openucx/ucx/blob/1308d2055ab0ba948eac213c8cfcd92776c34a53/src/uct/ib/base/ib_md.c#L480.
Thanks, cause A10 don't support dma-buf file descriptor. Can I use GDR on A10 with other methods ?
It should. Are you using the openrm variant of the GPU kernel-mode driver, see https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/ ?
function cuMemGetHandleForAddressRange
requires CU_MEM_RANGE_HANDLE_TYPE_DMA_BUF_FD
which is 0 on A10. nv_peer_mem
and nvidia-peermem
is already loaded, is there any other requirements ?
Hi @tangrc99,
Neither nvidia-peermem
nor nv_peer_mem
involves in dmabuf. A10 should support dmabuf. Could you check if your SW stack is new enough to support dmabuf?
Thanks, My Linux kernel 4.18.0 is too old.
In that case you can use the legacy RDMA memory registration path, i.e. ibv_reg_mr
, which involves the peer-direct kernel infrastructure (for example provided by MLNX_OFED) and nvidia-peermem
.
I try to register
gdr_mem
usingibv_reg_mr
, but got an errnoEFAULT
. I am using the A10 GPU on CentOS 8.5