While running in enroot base container lshca show the HCA port is if they were OK
But actually the /dev/infiniband directory and probably some other bits were not mounted in to the container
Thus the HCAs were inaccessible
It worth considering to check such things
The described behaviour can be set by using MELLANOX_VISIBLE_DEVICES=none in srun
See the NVIDIA devices documentation
/dev/infiniband content can be related to BDF via infiniband_mad or infiniband_verbs directorins in /sys/bus/pci/devices/<bdf>/
While running in enroot base container
lshca
show the HCA port is if they were OK But actually the/dev/infiniband
directory and probably some other bits were not mounted in to the container Thus the HCAs were inaccessible It worth considering to check such thingsThe described behaviour can be set by using
MELLANOX_VISIBLE_DEVICES=none
insrun
See the NVIDIA devices documentation/dev/infiniband
content can be related to BDF viainfiniband_mad
orinfiniband_verbs
directorins in/sys/bus/pci/devices/<bdf>/