ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
333 stars 101 forks source link

Can RX-550 be used for ROCm with ATS #141

Closed jack-chen1688 closed 1 month ago

jack-chen1688 commented 2 years ago

Hello,

I want to study the effect of ATS in computing performance. In drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c, I saw the following 3062 int amdgpu_vm_make_compute(struct amdgpu_device adev, struct amdgpu_vm vm) 3063 { 3064 bool pte_support_ats = (adev->asic_type == CHIP_RAVEN);

Does this mean that only GPU of CHIP_RAVEN can use ATS in guest OS?

I have a Lexa PRO GPU card and can run some HIP codes in VM now. From lspci info, it has ATS capabilities. See the info below.

2d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller]) Subsystem: Hewlett-Packard Company Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 103 Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at f0000000 (64-bit, prefetchable) [size=2M] Region 4: I/O ports at f000 [size=256] Region 5: Memory at fce00000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr+ FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #3, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <1us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, NROPrPrP-, LTR+ 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit+ 64bit+ 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn+ LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 14, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 40001010 000000ff e0000040 00000000 Capabilities: [200 v1] Resizable BAR <?> Capabilities: [270 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn-, PerformEqu- LaneErrStat: 0 Capabilities: [2b0 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [2c0 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00000020, Page Request Allocation: 00000000 Capabilities: [2d0 v1] Process Address Space ID (PASID) PASIDCap: Exec+ Priv+, Max PASID Width: 10 PASIDCtl: Enable- Exec- Priv- Capabilities: [320 v1] Latency Tolerance Reporting Max snoop latency: 1048576ns Max no snoop latency: 1048576ns Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [370 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=0us PortTPowerOnTime=170us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=32768ns L1SubCtl2: T_PwrOn=170us Kernel driver in use: vfio-pci Kernel modules: amdgpu

Is there a way to enable this card to use ATS in Guest OS? If not, could you suggest what cards can I use to try out the PCIe ATS feature? Thanks.

fxkamd commented 2 years ago

Do you mean ATS/PRI or ATC?

ATC is an address translation cache, which allows the GPU to cache IOMMU address translations. This can sometimes improve performance under virtualization and is mostly transparent to the driver.

ATS/PRI is a system where the GPU uses the IOMMUv2 to translate virtual addresses to physical addresses using CPU page tables. We use the latter on our Raven and older APUs on bare-metal. I'm not sure if this also works under virtualization. ATS/PRI also only works with AMD IOMMUv2, so you'd need an AMD CPU.

We have not pursued ATS/PRI with AMD discrete GPUs. Most of our discrete GPUs don't support this mode in the first place. We achieve a similar programming model on recent GPUs with HMM. Unfortunately this doesn't work on Lexa because of its limited virtual address space, which doesn't match the size of the CPU virtual address space. You'd need a Vega10 (GFXv9) or later GPU for HMM to be useful.

jack-chen1688 commented 2 years ago

Hi fxkamd,

In ATS spec, ATC is used by ATS. Below is from ATS spec chapter 2, this is what ATS does. A TA does translations. An ATC can cache those translations. If an ATC is separated from the TA by PCIe, the memory request from an ATC will need to be able to indicate if the address in the transaction is translated or not. The modifications to the memory transactions are described in this section, as are the transactions that are used to communicate translations between a remote ATC and a central TA.

So it seems to me ATC needs to use ATS to be able to cache IOMMU address translation, right? Or is there another way to cache IOMMU address translation.

There is an AMD CPU for my PC. When I load the amdgpu.ko by modprobe amdgpu, I saw the following kernel message. AMD-Vi: AMD IOMMUv2 loaded and initialized.

fxkamd commented 2 years ago

I guess my question is, which of these two aspects of ATS and ATC are you interested in:

  1. Performance and caching of GPA (guest physical address) to SPA (system physical address) translations under virtualization
  2. GVA (guest virtual address) to SPA translation using PASIDs and CPU page tables in bare metal

(1) is mostly transparent to the driver and to applications. It's purely a performance optimization for IO virtualization with hardware passed into a guest VM. (2) affects the programming model for ROCm applications because it makes any memory mapping in the application virtual address space implicitly accessible by the GPU.

The Raven-specific code you pointed to is for (2). It will not work on Lexa because its GPU virtually addressing works fundamentally differently.

jack-chen1688 commented 2 years ago

Can 1 work on Lexa? I have GPU passed through to guest OS and can run HIP program in guest OS. The ATS enable bit (ATSCtl: Enable+ in the above lspci info) for the device was set during the iommu init for host OS. In attach_device of drivers/iommu/amd/iommu.c. The following codes enabled the ATS bit in pci for Lexa. else if (amd_iommu_iotlb_sup && pci_enable_ats(pdev, PAGE_SHIFT) == 0) { dev_data->ats.enabled = true; dev_data->ats.qdep = pci_ats_queue_depth(pdev); }

I thought this is enough. But when I connect a pcie analyzer to trace the ATS translation request. I cannot see the device sends address translation request to the host side from pcie analyzer. So it seems to me ATC for Lexa still does not take effect after iommu enabled the ATS in PCI configuration register.

I wonder whether I missed some steps to enable ATS for Lexa. Any suggestions?
Is there any discrete GPU that be used for ROCm with ATS for 1 or 2 aspect that you mentioned or even without guest VM? Thanks

fxkamd commented 2 years ago

I got some more information from a colleague who works on our virtualization support:

For the whole ATS to work properly , it requires root complex also support the ATS , not only the pcie device (endpoint) side . The TA(translation agent) is in RC side . It's more complicate for virtualized platform . Some host has internal configuration to enable/disable the ATS (ex, esxi) . He need to check the hypervisior setting as well. I suggest he verify on the bare-metal configuration first.

jack-chen1688 commented 2 years ago

Thanks a lot for the information you provided.

I got the following from the AMD IOMMU spec. Translation Agent is a PCI-SIG term to refer to the IOMMU table walker. It seems that AMD's IOMMU is TA.

I debugged and checked the following initialization code for IOMMU,

    pci_read_config_dword(iommu->dev, cap_ptr + MMIO_CAP_HDR_OFFSET,
                          &iommu->cap);
    if (!(iommu->cap & (1 << IOMMU_CAP_IOTLB)))
            amd_iommu_iotlb_sup = false;

IOMMU_CAP_IOTLB is the bit 24 of IOMMU's capability header and it is set for my set-up. Its meaning in the IOMMU spec is below. 24 IotlbSup: IOTLB Support. RO. Reset Xb. Indicates the IOMMU will support ATS translation request messages as defined in PCI ATS 1.0 or later.

So it seems to me the IOMMU of my setup supports ATS.

The kernel also set the enable bit of ATS control register for the Lexa in the function of attach_device at the place below. else if (amd_iommu_iotlb_sup && pci_enable_ats(pdev, PAGE_SHIFT) == 0) { dev_data->ats.enabled = true; dev_data->ats.qdep = pci_ats_queue_depth(pdev); }

All the above is in bare-metal configuration. Is there anything else to configure Lexa to enable its ATS functionality?

I suggest he verify on the bare-metal configuration first.

Any suggestions on how to verify the bare-metal configuration? For example, could running some programs make GPU to send ATS request for address translation caching?

Thanks a lot for your help!

fxkamd commented 2 years ago

You could test this under bare metal by booting with the IOMMU in device-isolation mode. In this mode each device gets isolated in its own DMA address space, so they can only access system memory that was explicitly DMA mapped by the driver. It uses the IOMMU to translate device DMA addresses to system physical addresses. It should use ATC if it's supported by the IOMMU and the device. This combination of kernel parameters should work regardless of how your kernel was configured: iommu=nopt amd_iommu=force_isolation

jack-chen1688 commented 2 years ago

Thanks a lot for your suggestion. I have tried it out on my set-up with Lexa Pro GPU and here is some info.

  1. kernel parameters from /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.15.77+ root=UUID=af2eabd5-ad67-4db2-8342-13def92d087c ro 3 quiet splash iommu=nopt amd_iommu=force_isolation vt.handoff=7

  2. Host send out ATS invalidate request during device_flush_iotlb call, and GPU responded with invalidate completion. The pcie packets are observed from the PCIe protocol analyzer. Both request and completion packets follow the PCIe spec nicely.

From PCIE spec 5.0, 10.3, ATS invalidation Functions that do not support ATS will treat an Invalidate Request as UR (unsupported request). Functions supporting ATS are required to send an Invalidate Completion in response to a Invalidate Request independent of whether the Bus Master Enable bit is Set or not.

So based on the info, the GPU supports ATS. Otherwise, it will treat the request as unsupported and won't send out invalidate completion. Also the host can send out ATS invalidate request, this seems to further confirms that IOMMU on the host supports ATS.

  1. Run VectorAdd from https://github.com/ROCm-Developer-Tools/HIP-Examples successfully without problem. I can see lots of memory read/write operations from the PCIe analyzer when the example program executes. But there are no address translation requests (address type field of PCIe packet header set as 01b) sent from GPU to host side from the PCIe protocol analyzer.

From the info above, both IOMMU and the device supported ATS. But somehow ATC was not utilized by the Lexa Pro GPU for the VectorAdd program.

Could it be that the VectorAdd program won't trigger ATC? Or maybe the GPU requires some extra configuration to turn on ATC?

fxkamd commented 2 years ago

You should see address translation requests for system memory accesses from the GPU. ATS should result in fewer such requests because the GPU is caching the translations. If you're not seeing any translation requests, then we're missing something. Without a translation from device DMA address to system physical address, system memory accesses would use completely wrong addresses.

jack-chen1688 commented 2 years ago

Here is more info I gathered.

  1. Traced the IOVA -> physical address mapping created for the GPU device in the function of iommu_v1_map_page based on domain id. [90630.774835] AMD-Vi: AMD-Vi: iommu_v1_map_page domain id 28 iova df520000 pdaddr 1142e0000 size 0x20000 [90630.777518] AMD-Vi: AMD-Vi: iommu_v1_map_page domain id 28 iova df4f0000 pdaddr 124590000 size 0x8000 [90630.777674] AMD-Vi: AMD-Vi: iommu_v1_map_page domain id 28 iova df4ef000 pdaddr 17bb2b000 size 0x1000 ... [90630.946663] AMD-Vi: AMD-Vi: iommu_v1_map_page domain id 28 iova de6e0000 pdaddr 19234a000 size 0x2000 [90630.946668] AMD-Vi: AMD-Vi: iommu_v1_map_page domain id 28 iova de6e2000 pdaddr 19234c000 size 0x2000 [90630.946669] AMD-Vi: AMD-Vi: iommu_v1_map_page domain id 28 iova de6e4000 pdaddr 19234e000 size 0x2000

In the above, IOVA addresses are of 8 hex digits and physical addresses are of 9 hex digits. I think the physical addresses are corresponding to the system physical address you mentioned above.

  1. Capture the PCIe trace and found all the DMA writes from the GPU are to IOVA addresses, for example, df520000, de6E2000, etc.... These address are all untranslated address.

From the ATS spec, if address translation cache is used, address translation requests will be sent with addresses of IOVA. TA will return the the translation of the address as a read completion and ATC will cache the mapping of IOVA->physical address at GPU side. Then the GPU will send the DMA requests using the translated address, i.e. physical address with address type set as translated. Then IOMMU does not need to do translation using the page table on the host side.

For my set up, from the PCIe trace, DMA requests are all using the untranslated addresses of IOVA of df520000, de6e2000. IOMMU will need to translate it using the page table on the host side. So it seems that ATC on the GPU does not take effect at all.

What do you think? If you need me to gather some info of my setup, please let me know. Thanks.

ppanchad-amd commented 3 months ago

@jack-chen1688 Do you still need assistance with this ticket? If not, please close the ticket. Thanks!

ppanchad-amd commented 1 month ago

@jack-chen1688 Closing ticket. Please feel free to re-open ticket if you still need assistance. Thanks!