Mellanox / nv_peer_memory

305 stars 61 forks source link

Failed to reg big GPU mem #27

Closed Zhaojp-Frank closed 6 years ago

Zhaojp-Frank commented 7 years ago

Hi, we are running a PoC that tried to pin GPU memory, but failed with error msg below. it works if reg small GPU mem, such as a few MB, but failed if size >128MB; meanwhile, no problem when reg CPU memory at very large size;

could u pls help to check what's cause? thanks.

[2017/07/20-04:47:36.231116] xio_rdma_verbs.c:248 [ERROR] - ibv_reg_mr failed, Bad address. addr:0x7f21e3369dc0, length:15863892992, access:0x7 dmesg shows error in: [79199.665302] ib_umem_get: failed to get user pages, nr_pages=512 [79199.669653] mlx5_0:mr_umem_get:709:(pid 15855): umem get failed (-131668346275144)

Relevant system info: Mellanox Technologies MT28800 Family ConnectX-5, firmware version: 16.20.1010 256GB system memory (and lots of free mem at that moment) ubuntu 16.04, 4.4.0-83-generic MLNX_OFED_LINUX-4.1-1.0.2.0 (OFED-4.1-1.0.2) CUDA 8.0, 375.66 latest NV_peer_memm (checkout on July 2017) Tried Nvidia P100 (PCIe) and K80 GPU,

we already follow some practices here: https://community.mellanox.com/docs/DOC-1120 http://www.rdmamojo.com/2012/09/07/ibv_reg_mr/

root@B4130:~/tmp# ulimit -l unlimited root@B4130:~/tmp# cat /sys/module/mlx4_core/parameters/log_num_mtt 24

byronyi commented 7 years ago

I've tried register single chunk of >400MB K40m memory working with deep learning model training (VGG16) using TensorFlow (https://github.com/tensorflow/tensorflow/pull/11392), and we have not seen such problem. Are you sure you have the peer memory module enabled?

ferasd commented 7 years ago

Any updates on this fail? @Zhaojp-Frank

Zhaojp-Frank commented 6 years ago

problem still exists. Mellanox tech support will support to diagnose the issue. looks like some driver conflicts (not sure which ones, nv_peer, ofed, or mlx firmware) I'll update the result soon. thanks.

ferasd commented 6 years ago

@Zhaojp-Frank ?