Mellanox / nv_peer_memory

292 stars 60 forks source link

Are there size limitation on registering GPU memory #101

Closed heaibao817 closed 2 years ago

heaibao817 commented 2 years ago

When I use this function: ibv_reg_mr(rc_get_pd(), GPU_ADDR,SIZE, IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ)

If the SIZE is over 500MB ,it will return NULL.

I want to know if this problem caused by the limitation of nv_peer_memory or the RDMA driver.

GPU Tesla T4 Driver Version: 460.91.03 CUDA Version: 11.2 RDMA: mlx5_0

MassimoGirondi commented 2 years ago

Hi! I'm investigating this too. Looks like either a limitation of the nv_peer_memory module or the GPU driver, I can easily bind more than 64GB when using host memory and a Mellanox CX-5... For me it fails around 200MB. I am using the built-in nvidia_peermem module though, for what's matter.

MassimoGirondi commented 2 years ago

Accordingly to https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#display-bar-space

[...] It can be used to understand the application usage of BAR space, the primary resource consumed by GPUDirect RDMA mappings.

a certain amount of BAR space is reserved by the driver for internal use, so not all available memory may be usable via GPUDirect RDMA

So we are limited to the bar size that you can get with nvidia-smi -q

heaibao817 commented 2 years ago

Thank you, it has been solved. The max memory size is the bar size,T4 IS 256MB. When I change to V100 there is no limitation.