NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
MIT License
898 stars 144 forks source link

call ibv_reg_mr failed using mapped memory #266

Open tangrc99 opened 1 year ago

tangrc99 commented 1 year ago
    gpu_mem_handle_t m_t;    // from gdrcopy/test/common.h
    if( (r = gpu_mem_alloc(&m_t,10000,1,1) ) != CUDA_SUCCESS) {
        return -1;
    }
    gdr_mh_t handle;
    char *gpu_mapped_mem  = NULL;

    if( (ret = gdr_pin_buffer(g_t, m_t.ptr, m_t.allocated_size, 0,0,&handle)) != 0 ) {
        return -1;
    }
    if( (ret = gdr_map(g_t,handle,&gpu_mapped_mem,m_t.allocated_size) ) != 0 ){
        return -1;
    }

    char *gdr_mem =  gpu_mapped_mem;  // the ptr I try to register

I try to register gdr_mem using ibv_reg_mr, but got an errno EFAULT. I am using the A10 GPU on CentOS 8.5

drossetti commented 1 year ago

@tangrc99 this expected as the implementation of ibv_reg_mr in the Linux kernel requires the virtual address range to be backed by CPU memory pages.

More exactly, pin_user_pages does not work on CPU mappings of PCIe resources created via io_remap_pfn_range.

The official way of enabling RDMA on GPU memory is:

For a full deployment case, see for example https://github.com/openucx/ucx/blob/1308d2055ab0ba948eac213c8cfcd92776c34a53/src/uct/cuda/cuda_copy/cuda_copy_md.c#L410 and https://github.com/openucx/ucx/blob/1308d2055ab0ba948eac213c8cfcd92776c34a53/src/uct/ib/base/ib_md.c#L480.

tangrc99 commented 1 year ago

Thanks, cause A10 don't support dma-buf file descriptor. Can I use GDR on A10 with other methods ?

drossetti commented 1 year ago

It should. Are you using the openrm variant of the GPU kernel-mode driver, see https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/ ?

tangrc99 commented 1 year ago

function cuMemGetHandleForAddressRange requires CU_MEM_RANGE_HANDLE_TYPE_DMA_BUF_FD which is 0 on A10. nv_peer_mem and nvidia-peermem is already loaded, is there any other requirements ?

pakmarkthub commented 1 year ago

Hi @tangrc99,

Neither nvidia-peermem nor nv_peer_mem involves in dmabuf. A10 should support dmabuf. Could you check if your SW stack is new enough to support dmabuf?

tangrc99 commented 1 year ago

Thanks, My Linux kernel 4.18.0 is too old.

drossetti commented 1 year ago

In that case you can use the legacy RDMA memory registration path, i.e. ibv_reg_mr, which involves the peer-direct kernel infrastructure (for example provided by MLNX_OFED) and nvidia-peermem.