Azure / azhpc-images

Azure HPC/AI VM Images
MIT License
98 stars 79 forks source link

CentOS 7.7 HPC on Standard NC24rs_v3 - GPUs/IB devices missing #6

Closed tbugfinder closed 4 years ago

tbugfinder commented 4 years ago

Hello,

I'm asking my initial question from https://github.com/openlogic/AzureBuildCentOS/issues/92 :

I'm using CentOS 7.7 HPC image as a source for a packer build. Within the first steps I'm running lspci. Unfortunately it doesn't include all 4 NVidia GPUs, nor the Mellanox IB device. VM type is: Standard NC24rs_v3 (24 vcpus, 448 GiB memory), region westeurope - my understanding is that it should support SR-IOV.

azure-arm: + lspci
    azure-arm: 0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
    azure-arm: 0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
    azure-arm: 0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
    azure-arm: 0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
    azure-arm: 0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
    azure-arm: 3130:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
    azure-arm: + source /etc/os-release

Using CentOS 7.7 HPC on HB60rs lists the Mellanox IB device:

    azure-arm: + lspci
    azure-arm: 0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
    azure-arm: 0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
    azure-arm: 0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
    azure-arm: 0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
    azure-arm: 0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
    azure-arm: be1a:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]

Thanks.

tbugfinder commented 4 years ago

After running yum update GPUs were listed properly (kernel update required).

Updated drivers for the MLNX device fixed the IB issue.

jithinjosepkl commented 4 years ago

@tbugfinder - As Aman indicated in the other thread, this is due to a kernel bug. The fix is already in latest kernel for CentOS 7.7 (which you got with yum update). CentOS 7.6, unfortunately, will not have this fix (included only in RHEL 7.6).