Closed Zhaojp-Frank closed 6 years ago
I've tried register single chunk of >400MB K40m memory working with deep learning model training (VGG16) using TensorFlow (https://github.com/tensorflow/tensorflow/pull/11392), and we have not seen such problem. Are you sure you have the peer memory module enabled?
Any updates on this fail? @Zhaojp-Frank
problem still exists. Mellanox tech support will support to diagnose the issue. looks like some driver conflicts (not sure which ones, nv_peer, ofed, or mlx firmware) I'll update the result soon. thanks.
@Zhaojp-Frank ?
Hi, we are running a PoC that tried to pin GPU memory, but failed with error msg below. it works if reg small GPU mem, such as a few MB, but failed if size >128MB; meanwhile, no problem when reg CPU memory at very large size;
could u pls help to check what's cause? thanks.
[2017/07/20-04:47:36.231116] xio_rdma_verbs.c:248 [ERROR] - ibv_reg_mr failed, Bad address. addr:0x7f21e3369dc0, length:15863892992, access:0x7 dmesg shows error in: [79199.665302] ib_umem_get: failed to get user pages, nr_pages=512 [79199.669653] mlx5_0:mr_umem_get:709:(pid 15855): umem get failed (-131668346275144)
Relevant system info: Mellanox Technologies MT28800 Family ConnectX-5, firmware version: 16.20.1010 256GB system memory (and lots of free mem at that moment) ubuntu 16.04, 4.4.0-83-generic MLNX_OFED_LINUX-4.1-1.0.2.0 (OFED-4.1-1.0.2) CUDA 8.0, 375.66 latest NV_peer_memm (checkout on July 2017) Tried Nvidia P100 (PCIe) and K80 GPU,
we already follow some practices here: https://community.mellanox.com/docs/DOC-1120 http://www.rdmamojo.com/2012/09/07/ibv_reg_mr/
root@B4130:~/tmp# ulimit -l unlimited root@B4130:~/tmp# cat /sys/module/mlx4_core/parameters/log_num_mtt 24