MrBr-github / lshca

GNU General Public License v3.0
9 stars 6 forks source link

Partial propagation of userspace to container #83

Open MrBr-github opened 1 year ago

MrBr-github commented 1 year ago

While running in enroot base container lshca show the HCA port is if they were OK But actually the /dev/infiniband directory and probably some other bits were not mounted in to the container Thus the HCAs were inaccessible It worth considering to check such things

The described behaviour can be set by using MELLANOX_VISIBLE_DEVICES=none in srun See the NVIDIA devices documentation

/dev/infiniband content can be related to BDF via infiniband_mad or infiniband_verbs directorins in /sys/bus/pci/devices/<bdf>/

root@j....3 /sys/bus/pci/devices/0000:81:00.0
#ll /dev/infiniband/
total 0
crw------- 1 root root 231,  64 Sep 21 21:42 issm0
crw------- 1 root root 231,  65 Sep 21 21:42 issm1
crw------- 1 root root 231,  66 Sep 21 21:42 issm2
crw------- 1 root root 231,  67 Sep 21 21:42 issm3
crw------- 1 root root 231,  68 Sep 21 21:42 issm4
crw------- 1 root root 231,  69 Sep 21 21:42 issm5
crw-rw-rw- 1 root root  10,  56 Sep 21 21:43 rdma_cm
crw-rw-rw- 1 root root 231,   0 Sep 21 21:42 umad0
crw-rw-rw- 1 root root 231,   1 Sep 21 21:42 umad1
crw-rw-rw- 1 root root 231,   2 Sep 21 21:42 umad2
crw-rw-rw- 1 root root 231,   3 Sep 21 21:42 umad3
crw-rw-rw- 1 root root 231,   4 Sep 21 21:42 umad4
crw-rw-rw- 1 root root 231,   5 Sep 21 21:42 umad5
crw-rw-rw- 1 root root 231, 192 Sep 21 21:42 uverbs0
crw-rw-rw- 1 root root 231, 193 Sep 21 21:42 uverbs1
crw-rw-rw- 1 root root 231, 194 Sep 21 21:42 uverbs2
crw-rw-rw- 1 root root 231, 195 Sep 21 21:42 uverbs3
crw-rw-rw- 1 root root 231, 196 Sep 21 21:42 uverbs4
crw-rw-rw- 1 root root 231, 197 Sep 21 21:42 uverbs5

root@j...3 /sys/bus/pci/devices/0000:81:00.0
#ll /sys/bus/pci/devices/0000:82:00.0/infiniband_mad
total 0
drwxr-xr-x 3 root root 0 Sep 23 16:54 issm4
drwxr-xr-x 3 root root 0 Sep 23 16:54 umad4
root@j....3 /sys/bus/pci/devices/0000:81:00.0
#ll /sys/bus/pci/devices/0000:82:00.0/infiniband_verbs
total 0
drwxr-xr-x 3 root root 0 Sep 23 16:54 uverbs4
MrBr-github commented 1 year ago

https://www.kernel.org/doc/Documentation/infiniband/user_verbs.txt