ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
297 stars 94 forks source link

KVM Support on proxmox #100

Open ljmc-github opened 3 years ago

ljmc-github commented 3 years ago

Hi,

TLDR : I have a small proxmox server at home and planned on using ROCm for some deep learning tasks I have. I have followed all the different information I could find but still cannot use ROCm in a VM (KVM).


The server contains a Supermicro X10SDV-6C-TLN4F (Intel Xeon D-1528) and a Radeon Pro Duo Polaris.

If I understood the documentation correctly, a Broadwell Xeon v4 CPU should work and so should Polaris 10 cards.

The PCI tree from lspci -tv is:

 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2
             +-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]
                                             |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
                                             \-10.0-[07]----00.0  Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]

The many steps in the tree are Broadcom PLX switches which should also support PCIe atomics. I believe they are directly in the GPU, and that they might actually be a single chip, not exactly sure how to interpret deep trees in lspci.

# lspci -s 04:00
04:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
# lspci -s 05:08
05:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
# lspci -s 05:10
05:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)

I first did some research and found the Virtualisation & Containers page in the documentation and this now closed issue #26. Both of which explain how to set up for KVM, followed ubunutu 16.04 instructions given that I am using Proxmox which is Debian based.

Having had no success with ROCm I tested OpenCL (AMDGPU-PRO) and games (mesa) in other VMs and both work as expected, so the issue should not be with PCIe passthrough setup or IOMMU, but with PCIe atomics.

Following the steps in the issue I set up the necessary bits (setpci -v -d *:67c4 80.b=40) after starting the VM, then load amdgpu, but, I cannot get ROCm to run. The steps in the documentation gave the same results. I always get:

[  169.707831] kfd kfd: skipped device 1002:67c4, PCI rejects atomics
[  170.061947] kfd kfd: skipped device 1002:67c4, PCI rejects atomics

Is there anything I have missed about KVM support ?

Thanks in advance for your help.

fxkamd commented 3 years ago

Looks like your system does not support PCIe atomics. This is part of the PCIe 3 standard and is required for KFD to work on some GPUs, including Polaris.

ljmc-github commented 3 years ago

I checked on a fresh Ubuntu 18.04.4, rocm installs properly and cards are detected by rocm-smi and clinfo. So I still think this is a KVM issue...

$ /opt/rocm/bin/rocm-smi 

========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr   SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
1    35.0c  20.243W  567Mhz  300Mhz  18.82%  auto  98.0W     0%   0%    
2    33.0c  22.133W  567Mhz  300Mhz  18.82%  auto  98.0W     0%   0%    
================================================================================
==============================End of ROCm SMI Log ==============================

$ /opt/rocm/opencl/bin/clinfo 
Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (3137.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Ellesmere [Radeon Pro WX 7100]
  Device Topology:               PCI[ B#6, D#0, F#0 ]
  Max compute units:                 36
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1243Mhz
  Address bits:                  64
  Max memory allocation:             14602888806
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            26564
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     No
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                17179869184
  Constant buffer size:              14602888806
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              1717986918
  Max global variable size:          14602888806
  Max global variable preferred total size:  17179869184
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7f64f6827cf0
  Name:                      gfx803
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3137.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Ellesmere [Radeon Pro WX 7100]
  Device Topology:               PCI[ B#7, D#0, F#0 ]
  Max compute units:                 36
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1243Mhz
  Address bits:                  64
  Max memory allocation:             14602888806
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            26564
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     No
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                17179869184
  Constant buffer size:              14602888806
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              1717986918
  Max global variable size:          14602888806
  Max global variable preferred total size:  17179869184
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7f64f6827cf0
  Name:                      gfx803
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3137.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 
GongYiLiao commented 3 years ago

In your reference to ROCM's Issue #26 , the kfd code have been changed since then therefore the modification to pci_enable_atomic_ops_to_root does not work anymore. A workaround that work for me is to comment out the following code block in /usr/src/amdgpu-3.8-30/amd/amdkfd/kfd_device.c (around line 567):

      /* Allow BIF to recode atomics to PCIe 3.0 AtomicOps.
         * 32 and 64-bit requests are possible and must be
         * supported.
         */
        kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd);
        /* if (device_info->needs_pci_atomics &&
         *    !kfd->pci_atomic_requested) {
         *              dev_info(kfd_device,
         *                       "skipped device %x:%x, PCI rejects atomics\n",
         *                       pdev->vendor, pdev->device);
         *      kfree(kfd);
         *      return NULL;
         *}

The rationale to do so is similar to ROCM's issue #26 : the KVM guest cannot correctly recognize a PCIE devices and always regard them as ordinary PCI devices that kfd's requirement of PCIE 3.0 atomic operation capability will never be fulfilled.

After you build your amdgpu.ko with this workaround hack (it does not solve the root cause of the issue) and follow the instructions listed in #26 to set PCIE bit from the host machine and don't load amdgpu.ko at boot time but load amdgpu.ko manually after booting process completes, you should able to see your GPU in rocminfo.

Above steps work for me, but the amdgpu reset problem remains. You need to put your host machine into sleep (suspend to ram state) and wake it up to re-scan pci device occasionally if start/shutdown your VM couple times. This PITA is something I tried to resolve but no luck so far.

bd4 commented 2 years ago

I am also seeing this, with KVM 5.2 running on Debian bullseye, kernel 5.10, guest Ubuntu kernel 5.8. ROCm and atomics work fine on the host system. In the guest with vfio-pci passthrough, I have tried the in tree amdgpu module, and the dkms package with ROCm 4.1 and 4.2, and they all fail with the kfd PCI rejects atomics error. Is there an older kernel version that won't have this issue that I should try? Is someone working on a fix?

bd4 commented 2 years ago

Here are more details. On the host, both the PCIe root port and the AMD VGA device have AtomicOpsCap:

00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 122
        IOMMU group: 1
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000e000-0000efff [size=4K]
        Memory behind bridge: de000000-df0fffff [size=17M]
        Prefetchable memory behind bridge: 00000000a0000000-00000000b1ffffff [size=288M]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA+ VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [88] Subsystem: Micro-Star International Co., Ltd. [MSI] 6th-9th Gen Core Processor PCIe Controller (x16)
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee00218  Data: 0000
        Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #2, Speed 8GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt+
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #1, PowerLimit 75.000W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet+ LinkState-
                RootCap: CRSVisible-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd-
                         AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS+
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Via WAKE#, ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-

...

02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] Radeon RX 570 Armor 8G OC
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 154
        IOMMU group: 1
        Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
        Region 4: I/O ports at d000 [size=256]
        Region 5: Memory at df300000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at df340000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-

and on the guest, only AMD VGA has AtomicOpsCap:

00:02.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port (prog-if 00 [Normal decode
])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step
ping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 22
        Region 0: Memory at cbc4b000 (32-bit, non-prefetchable) [size=4K]
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 0000d000-0000dfff [size=4K]
        Memory behind bridge: cba00000-cbbfffff [size=2M]
        Prefetchable memory behind bridge: 0000000840200000-00000008402fffff [si
ze=1M]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [54] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransP
end-
                LnkCap: Port #16, Speed 16GT/s, Width x32, ASPM L0s, Exit Latenc
y L0s <64ns
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)
                        TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
                SltCap: AttnBtn+ PwrCtrl+ MRL- AttnInd+ PwrInd+ HotPlug+ Surpris
e+
                        Slot #0, PowerLimit 0.000W; Interlock+ NoCompl-
                SltCtl: Enable: AttnBtn+ PwrFlt- MRL- PresDet- CmdCplt+ HPIrq+ L
inkChg-
                        Control: AttnInd Off, PwrInd On, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCap: CRSVisible-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix+, MaxEETLPPrefixes 4
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd+
                         AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

...

08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) (prog-if 00 [VGA controller])
        Subsystem: Micro-Star International Co., Ltd. [MSI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
        Physical Slot: 0-7
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 49
        Region 0: Memory at 830000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at 840000000 (64-bit, prefetchable) [size=2M]
        Region 4: I/O ports at 6000 [size=256]
        Region 5: Memory at cb000000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at cb060000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x8 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, NROPrPrP-, LTR+
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
bd4 commented 2 years ago

If I try the workaround suggested by @GongYiLiao from #26 (patch applied to dkms source, set pci on host after vm boot, load amdgpu), I get a firmware load error:

[   64.575282] [drm] amdgpu kernel modesetting enabled.
[   64.575283] [drm] amdgpu version: 5.9.15
[   64.575405] amdgpu: CRAT table not found
[   64.575411] amdgpu: Virtual CRAT table created for CPU
[   64.575418] amdgpu: Topology: Add CPU node
[   64.578450] amdgpu 0000:08:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   64.578522] amdgpu 0000:08:00.0: amdgpu: Fetched VBIOS from VFCT
[   64.578524] amdgpu: ATOM BIOS: MS-V34114-F3
[   64.588167] amdgpu 0000:08:00.0: BAR 2: releasing [mem 0x840000000-0x8401fffff 64bit pref]
[   64.588170] amdgpu 0000:08:00.0: BAR 0: releasing [mem 0x830000000-0x83fffffff 64bit pref]
[   64.588233] amdgpu 0000:08:00.0: BAR 0: assigned [mem 0x830000000-0x83fffffff 64bit pref]
[   64.588285] amdgpu 0000:08:00.0: BAR 2: assigned [mem 0x840000000-0x8401fffff 64bit pref]
[   64.596544] amdgpu 0000:08:00.0: amdgpu: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[   64.596547] amdgpu 0000:08:00.0: amdgpu: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[   64.597031] [drm] amdgpu: 8192M of VRAM memory ready
[   64.597034] [drm] amdgpu: 16007M of GTT memory ready.
[   64.620261] amdgpu: [powerplay] hwmgr_sw_init smu backed is polaris10_smu
[   67.949534] amdgpu: [powerplay] 
[   69.624963] amdgpu: [powerplay] 
[   71.336870] amdgpu: [powerplay] SMU Firmware start failed!
[   71.336870] amdgpu: [powerplay] Failed to load SMU ucode.
[   71.336873] amdgpu: [powerplay] fw load failed
[   71.337620] amdgpu: smu firmware loading failed
[   71.338112] amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_init failed
[   71.338821] amdgpu 0000:08:00.0: amdgpu: Fatal error during GPU init
[   71.341105] amdgpu: probe of 0000:08:00.0 failed with error -22

This is with the rock-dkms package from ROCm 4.1 and HWE kernel 5.8 in ubuntu guest. I will try with ROCm 4.2 dkms as well.

bd4 commented 2 years ago

I was able to get the module to load with kfd atomics support, after a reboot, with the patch applied but NOT running the setpci on the host. However when I try to any ROCm code, it hangs in sched_yield.

From strace:

get_mempolicy(NULL, NULL, 0, NULL, 0)   = 0
madvise(0x2021000, 4096, MADV_DONTFORK) = 0
ioctl(3, AMDKFD_IOC_ALLOC_MEMORY_OF_GPU, 0x7ffff95d9210) = 0
ioctl(3, AMDKFD_IOC_MAP_MEMORY_TO_GPU, 0x7ffff95d9240) = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

So at this point I have no way to use an AMD gpu from KVM, the proposed workarounds also fail to work. Is this related to the PCIe bridge inside my KVM not supporting atomics? @GongYiLiao when you got it working, does lspci show atomcis supported by the PCIe bridge?

bd4 commented 2 years ago

Sorry for the noise - I now have it working completely, using setpci on the host. I guess when I was trying different things, the GPU got in a bad state, and rebooting the host fixed it. It would be great if this worked out of the box, without a hack involving patching the kernel module and running a command on host between boot and module load.

rico666 commented 2 years ago

How to cope with the ROCm crippleware

from https://en.wikipedia.org/wiki/Crippleware

While crippleware allows consumers to see the software before they buy, they are unable to test its complete functionality because of the disabled functions.

If you want to use an AMD device in a KVM VM, you need to dodge the traps AMD devs deliberately built in. You need to:

  1. do the regular pci pass through for your GPU device(s) like -device vfio-pci,host=0000:81:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on
  2. blacklist the loading of amdgpu in your KVM
  3. either in the host or the VM set the PCIe atomics bit via setpci -v -d *:67e3 80.b=40 (where the 67e3 is in this case a Polaris 11 chip, could be different with yours)
  4. Most importantly: patch the AMD GPU driver as suggested by @GongYiLiao in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/100#issuecomment-703012241 - thus removing the crippleware-ness from the AMD code. Then compile and install (make modules && make modules_install) for the current amdgpu in linux-5.13.12 the code to disable is in line 626-633
  5. modprobe your patched amdgpu
    [   64.630541] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
    [   64.630832] amdgpu: Virtual CRAT table created for GPU
    [   64.631094] amdgpu: Topology: Add dGPU node [0x67e3:0x1002]
    [   64.631104] kfd kfd: amdgpu: added device 1002:67e3

=> profit.

rocminfo:

*******                  
Agent 3                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Baffin [Radeon Pro WX 4100]
...  

Of course this is just a dirty quickhack, a workaround. It nevertheless works and transforms the hardware you bought from AMD for good money from a piece of worthless junk into something usable at last. This comment is here to spare others to spend 3 days full-time to have to work this out and the unbearable pain of reading comments like "It's not the fault of ROCm".

fxkamd commented 2 years ago

@rico666 : I don't think the term "crippleware" applies to open-source software. If there are traps, there is nothing deliberate about them. If we had infinite time, we'd support every feature anyone could ever wish for. We don't have infinite time. But we certainly don't waste the limited time we do have to lay elaborate traps to frustrate our users.

rico666 commented 2 years ago

@rico666 : I don't think the term "crippleware" applies to open-source software.

You might want to read the linked WP article in it's entirety.

From an Open Source software providers perspective, there is the model of open core which includes a feature-limited version of the product and an open-core version. The feature-limited version can be used widely; this approach is used by products like MySQL and Eucalyptus.

However, I admit that my comment must be seen in light of 3 lost days full-bridge-rectifying a situation that was messed up by the devs of amdgpu without any need. You see - ROCm in the year 2017 never had this PCIe atomics problem, at least not with this hardware. Whoever coded that stuff in, made it worse. I call that crippleware. The position of "oh my all hardware is bad, we do not support hardware that has quirks (in our opinion, where it even may not have quirks, but we have adopted that view so we assume otherwise)" is a very bad one. Unfortunately it seems the responsible lead developer(s)/manager(s) are unwilling or unable to teach their devs team correct R&D values.

I could now assemble the statements in several issues here on github - some of them closed already - how this can't be solved, how the AMDGPU can't set this or that bit (yeah - but root can?) and how it would hurt performance otherwise.

If I'd be the responsible lead developer (and team: pray I won't take that job), then the design question would be "What is more performance? 0% or 20% of the potential 100%?

=> Make it work first and always, tweak later.

You know, for someone who sees and treats OpenCL as the unwanted child (Nvidia), their stuff "just works". If what AMD delivers in the OpenCL area is what an "OpenCL protagonist" should deliver, then good night.

GongYiLiao commented 2 years ago

I personally don't have any evidence suggesting that AMD intentionally "cripple" the GFX8 customers to force them buy newer generation of Radeons. Most frusrtrated AMD customers may just vote by foot when the GPU market eventually move back to normal, assuming cryptocurerrncy mining eventually become unprofitable unless using customized FPGA. When that time comes, AMD will take the hit.

To me, the more severe problems of ROCm platform, regardless running on bare metal or virtual machine, are

  1. AMD haven't give a clear road map when a specific generation of Radeon will lose support (search gfx803 on ROCM Issues)
  2. AMD does not communicate well with the ROCm user community, at least on GitHub. There are many issues are promptly closed by "ROCMSupport" (I am not sure if this account belongs to an AMD employee, but it looks likely to me) with comments such as "We don't support it due to limited resources" and provide no actionable plan or meaningful explanation whatsoever.
  3. AMD 's ROCm puts major focus and resource on HIP, which seems a second-class re-implementation or compatibility layer of CUDA, in order to gain some ground on various market dominating packages such as TensorFlow or PyTorch. I am not sure if that will work out eventualy , but in order to pursue this route, AMD seemly abandons OpenCL already. However, my personal impression is that most computation-oriented customer buying Radeon for OpenCL, not a CUDA copy cat.
  4. Drop support to products those are not old. I have been struck by this twice: I bought a HD 7950 (Tahiti) in 2013 and the OpenCL support via fglrx was dropped in 2015. Now ROCm drops support to RX 580 that I bought in 2019 in early 2021.

Above are my own rants on how AMD's appoarch to Linux customers lead to some sensible angers.

Let's circle back to the issue of using Radeon for computation on a KVM guest. I don't think AMD will solve the PCIe atomic issue ( if AMD deems this as an issue) any time soon as gfx8 is a semi-deprecated product already. The only solution I can come up with in the near future is just the dirty workaround I found by ```dmesg | grep kfd" and search where the error message pops in the kernel module source code file. Therefore, the following appoach may be sutiable for long term, on :

  1. For Debian/Ubuntu user can build their own kernel package by puting this atomic-check-disabling patch under debian/patches of a kernel source package and prepare a script to set PCIe status aftre the KVM guest starts but before loading amdgpu.ko
  2. Don't buy AMD again, seriously.
pikakolendo02 commented 2 years ago
2. Don't buy AMD again, seriously.

I have spent days to find out the reason why some opencl programs get error in linux VM. The answer is, ROCm sucks. Ironically, when my passthrough under a Windows VM with official drivers (including opencl), they work perfectly.

kotee4ko commented 8 months ago

The last arg why I'm still trying to launch ML on Radeon, is because I belive, that Radeon hardware has more perfomance for this tasks, that Nvidia.

But totally unusable software ecosystem make it qute useless...

briansp2020 commented 2 months ago

@ROCmSupport, Does AMD plan to support running ROCm in a VM? It would be useful to be able to do this.