Closed jack-chen1688 closed 1 month ago
Do you mean ATS/PRI or ATC?
ATC is an address translation cache, which allows the GPU to cache IOMMU address translations. This can sometimes improve performance under virtualization and is mostly transparent to the driver.
ATS/PRI is a system where the GPU uses the IOMMUv2 to translate virtual addresses to physical addresses using CPU page tables. We use the latter on our Raven and older APUs on bare-metal. I'm not sure if this also works under virtualization. ATS/PRI also only works with AMD IOMMUv2, so you'd need an AMD CPU.
We have not pursued ATS/PRI with AMD discrete GPUs. Most of our discrete GPUs don't support this mode in the first place. We achieve a similar programming model on recent GPUs with HMM. Unfortunately this doesn't work on Lexa because of its limited virtual address space, which doesn't match the size of the CPU virtual address space. You'd need a Vega10 (GFXv9) or later GPU for HMM to be useful.
Hi fxkamd,
In ATS spec, ATC is used by ATS. Below is from ATS spec chapter 2, this is what ATS does. A TA does translations. An ATC can cache those translations. If an ATC is separated from the TA by PCIe, the memory request from an ATC will need to be able to indicate if the address in the transaction is translated or not. The modifications to the memory transactions are described in this section, as are the transactions that are used to communicate translations between a remote ATC and a central TA.
So it seems to me ATC needs to use ATS to be able to cache IOMMU address translation, right? Or is there another way to cache IOMMU address translation.
There is an AMD CPU for my PC. When I load the amdgpu.ko by modprobe amdgpu, I saw the following kernel message. AMD-Vi: AMD IOMMUv2 loaded and initialized.
I guess my question is, which of these two aspects of ATS and ATC are you interested in:
(1) is mostly transparent to the driver and to applications. It's purely a performance optimization for IO virtualization with hardware passed into a guest VM. (2) affects the programming model for ROCm applications because it makes any memory mapping in the application virtual address space implicitly accessible by the GPU.
The Raven-specific code you pointed to is for (2). It will not work on Lexa because its GPU virtually addressing works fundamentally differently.
Can 1 work on Lexa? I have GPU passed through to guest OS and can run HIP program in guest OS. The ATS enable bit (ATSCtl: Enable+ in the above lspci info) for the device was set during the iommu init for host OS. In attach_device of drivers/iommu/amd/iommu.c. The following codes enabled the ATS bit in pci for Lexa. else if (amd_iommu_iotlb_sup && pci_enable_ats(pdev, PAGE_SHIFT) == 0) { dev_data->ats.enabled = true; dev_data->ats.qdep = pci_ats_queue_depth(pdev); }
I thought this is enough. But when I connect a pcie analyzer to trace the ATS translation request. I cannot see the device sends address translation request to the host side from pcie analyzer. So it seems to me ATC for Lexa still does not take effect after iommu enabled the ATS in PCI configuration register.
I wonder whether I missed some steps to enable ATS for Lexa. Any suggestions?
Is there any discrete GPU that be used for ROCm with ATS for 1 or 2 aspect that you mentioned or even without guest VM? Thanks
I got some more information from a colleague who works on our virtualization support:
For the whole ATS to work properly , it requires root complex also support the ATS , not only the pcie device (endpoint) side . The TA(translation agent) is in RC side . It's more complicate for virtualized platform . Some host has internal configuration to enable/disable the ATS (ex, esxi) . He need to check the hypervisior setting as well. I suggest he verify on the bare-metal configuration first.
Thanks a lot for the information you provided.
I got the following from the AMD IOMMU spec. Translation Agent is a PCI-SIG term to refer to the IOMMU table walker. It seems that AMD's IOMMU is TA.
I debugged and checked the following initialization code for IOMMU,
pci_read_config_dword(iommu->dev, cap_ptr + MMIO_CAP_HDR_OFFSET,
&iommu->cap);
if (!(iommu->cap & (1 << IOMMU_CAP_IOTLB)))
amd_iommu_iotlb_sup = false;
IOMMU_CAP_IOTLB is the bit 24 of IOMMU's capability header and it is set for my set-up. Its meaning in the IOMMU spec is below. 24 IotlbSup: IOTLB Support. RO. Reset Xb. Indicates the IOMMU will support ATS translation request messages as defined in PCI ATS 1.0 or later.
So it seems to me the IOMMU of my setup supports ATS.
The kernel also set the enable bit of ATS control register for the Lexa in the function of attach_device at the place below. else if (amd_iommu_iotlb_sup && pci_enable_ats(pdev, PAGE_SHIFT) == 0) { dev_data->ats.enabled = true; dev_data->ats.qdep = pci_ats_queue_depth(pdev); }
All the above is in bare-metal configuration. Is there anything else to configure Lexa to enable its ATS functionality?
I suggest he verify on the bare-metal configuration first.
Any suggestions on how to verify the bare-metal configuration? For example, could running some programs make GPU to send ATS request for address translation caching?
Thanks a lot for your help!
You could test this under bare metal by booting with the IOMMU in device-isolation mode. In this mode each device gets isolated in its own DMA address space, so they can only access system memory that was explicitly DMA mapped by the driver. It uses the IOMMU to translate device DMA addresses to system physical addresses. It should use ATC if it's supported by the IOMMU and the device. This combination of kernel parameters should work regardless of how your kernel was configured: iommu=nopt amd_iommu=force_isolation
Thanks a lot for your suggestion. I have tried it out on my set-up with Lexa Pro GPU and here is some info.
kernel parameters from /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.15.77+ root=UUID=af2eabd5-ad67-4db2-8342-13def92d087c ro 3 quiet splash iommu=nopt amd_iommu=force_isolation vt.handoff=7
Host send out ATS invalidate request during device_flush_iotlb call, and GPU responded with invalidate completion. The pcie packets are observed from the PCIe protocol analyzer. Both request and completion packets follow the PCIe spec nicely.
From PCIE spec 5.0, 10.3, ATS invalidation Functions that do not support ATS will treat an Invalidate Request as UR (unsupported request). Functions supporting ATS are required to send an Invalidate Completion in response to a Invalidate Request independent of whether the Bus Master Enable bit is Set or not.
So based on the info, the GPU supports ATS. Otherwise, it will treat the request as unsupported and won't send out invalidate completion. Also the host can send out ATS invalidate request, this seems to further confirms that IOMMU on the host supports ATS.
From the info above, both IOMMU and the device supported ATS. But somehow ATC was not utilized by the Lexa Pro GPU for the VectorAdd program.
Could it be that the VectorAdd program won't trigger ATC? Or maybe the GPU requires some extra configuration to turn on ATC?
You should see address translation requests for system memory accesses from the GPU. ATS should result in fewer such requests because the GPU is caching the translations. If you're not seeing any translation requests, then we're missing something. Without a translation from device DMA address to system physical address, system memory accesses would use completely wrong addresses.
Here is more info I gathered.
In the above, IOVA addresses are of 8 hex digits and physical addresses are of 9 hex digits. I think the physical addresses are corresponding to the system physical address you mentioned above.
From the ATS spec, if address translation cache is used, address translation requests will be sent with addresses of IOVA. TA will return the the translation of the address as a read completion and ATC will cache the mapping of IOVA->physical address at GPU side. Then the GPU will send the DMA requests using the translated address, i.e. physical address with address type set as translated. Then IOMMU does not need to do translation using the page table on the host side.
For my set up, from the PCIe trace, DMA requests are all using the untranslated addresses of IOVA of df520000, de6e2000. IOMMU will need to translate it using the page table on the host side. So it seems that ATC on the GPU does not take effect at all.
What do you think? If you need me to gather some info of my setup, please let me know. Thanks.
@jack-chen1688 Do you still need assistance with this ticket? If not, please close the ticket. Thanks!
@jack-chen1688 Closing ticket. Please feel free to re-open ticket if you still need assistance. Thanks!
Hello,
I want to study the effect of ATS in computing performance. In drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c, I saw the following 3062 int amdgpu_vm_make_compute(struct amdgpu_device adev, struct amdgpu_vm vm) 3063 { 3064 bool pte_support_ats = (adev->asic_type == CHIP_RAVEN);
Does this mean that only GPU of CHIP_RAVEN can use ATS in guest OS?
I have a Lexa PRO GPU card and can run some HIP codes in VM now. From lspci info, it has ATS capabilities. See the info below.
2d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller]) Subsystem: Hewlett-Packard Company Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 103
Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at f0000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at f000 [size=256]
Region 5: Memory at fce00000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr+ FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #3, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, NROPrPrP-, LTR+
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn+
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 14, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 40001010 000000ff e0000040 00000000
Capabilities: [200 v1] Resizable BAR <?>
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [2b0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [2c0 v1] Page Request Interface (PRI)
PRICtl: Enable- Reset-
PRISta: RF- UPRGI- Stopped+
Page Request Capacity: 00000020, Page Request Allocation: 00000000
Capabilities: [2d0 v1] Process Address Space ID (PASID)
PASIDCap: Exec+ Priv+, Max PASID Width: 10
PASIDCtl: Enable- Exec- Priv-
Capabilities: [320 v1] Latency Tolerance Reporting
Max snoop latency: 1048576ns
Max no snoop latency: 1048576ns
Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=32768ns
L1SubCtl2: T_PwrOn=170us
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
Is there a way to enable this card to use ATS in Guest OS? If not, could you suggest what cards can I use to try out the PCIe ATS feature? Thanks.