geerlingguy / raspberry-pi-pcie-devices

Raspberry Pi PCI Express device compatibility database
http://pipci.jeffgeerling.com
GNU General Public License v3.0
1.53k stars 137 forks source link

Test GPU (AMD Radeon RX 6700 XT) #222

Open geerlingguy opened 2 years ago

geerlingguy commented 2 years ago

Working branch: https://github.com/geerlingguy/linux/pull/1

Just received an OEM AMD Radeon RX 6700 XT in the mail. I was able to get it at MSRP+Shipping, which is something of a miracle these days:

DSC02333

DSC02363

I will be interested in seeing what, if anything, the card does when powered up and plugged into the Compute Module 4 IO Board!

The following issues are closely related:

Latest recap: https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/222#issuecomment-919530424

geerlingguy commented 2 years ago

A few notes on drivers from the Twitterverse:

@linux4kix mentioned:

@geerlingguy You will need to use a pre 5.10 kernel for basic Navi on Aarch64. A driver rework needs to be done to fix amdgpu dcn support which was reverted for 5.10. https://lists.freedesktop.org/archives/dri-devel/2021-January/292867.html

@ric96 said:

@geerlingguy Don't forget to use upstream linux-firmware for the correct blob

So yeah... this one could be interesting, and I think my first attempts will be a bit faltering. We'll see.

geerlingguy commented 2 years ago
pi@cm4:~ $ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 20)
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478 (rev c1)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73df (rev c1)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28

pi@cm4:~ $ sudo lspci -vvvv
...
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478 (rev c1) (prog-if 00 [Normal decode])
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 255
    Region 0: Memory at 618200000 (32-bit, non-prefetchable) [disabled] [size=16K]
    Bus: primary=01, secondary=02, subordinate=03, sec-latency=0
    I/O behind bridge: 0000f000-00000fff
    Memory behind bridge: d8000000-d81fffff
    Prefetchable memory behind bridge: 00000000c0000000-00000000d7ffffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Express (v2) Upstream Port, MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 0.000W
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed unknown, Width x16, ASPM L1, Exit Latency L0s unlimited, L1 <64us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
        LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
    Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
    Capabilities: [270 v1] #19
    Capabilities: [320 v1] Latency Tolerance Reporting
        Max snoop latency: 0ns
        Max no snoop latency: 0ns
    Capabilities: [400 v1] #25
    Capabilities: [410 v1] #26
    Capabilities: [440 v1] #27

02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479 (prog-if 00 [Normal decode])
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 255
    Bus: primary=02, secondary=03, subordinate=03, sec-latency=0
    I/O behind bridge: 0000f000-00000fff
    Memory behind bridge: d8000000-d81fffff
    Prefetchable memory behind bridge: 00000000c0000000-00000000d7ffffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Express (v2) Downstream Port (Slot-), MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0
            ExtTag+ RBE+
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
        LnkCtl: ASPM Disabled; Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported ARIFwd-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
        LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
             EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
    Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
    Capabilities: [270 v1] #19
    Capabilities: [2a0 v1] Access Control Services
        ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [400 v1] #25
    Capabilities: [410 v1] #26
    Capabilities: [440 v1] #27

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73df (rev c1) (prog-if 00 [VGA controller])
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0e36
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 255
    Region 0: Memory at 600000000 (64-bit, prefetchable) [disabled] [size=256M]
    Region 2: Memory at 610000000 (64-bit, prefetchable) [disabled] [size=2M]
    Region 4: I/O ports at <unassigned> [disabled]
    Region 5: Memory at 618000000 (32-bit, non-prefetchable) [disabled] [size=1M]
    [virtual] Expansion ROM at 618100000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
        LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
             EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
    Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
    Capabilities: [200 v1] #15
    Capabilities: [240 v1] Power Budgeting <?>
    Capabilities: [270 v1] #19
    Capabilities: [2a0 v1] Access Control Services
        ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [2d0 v1] Process Address Space ID (PASID)
        PASIDCap: Exec+ Priv+, Max PASID Width: 10
        PASIDCtl: Enable- Exec- Priv-
    Capabilities: [320 v1] Latency Tolerance Reporting
        Max snoop latency: 0ns
        Max no snoop latency: 0ns
    Capabilities: [410 v1] #26
    Capabilities: [440 v1] #27

03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin B routed to IRQ 255
    Region 0: Memory at 618120000 (32-bit, non-prefetchable) [disabled] [size=16K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
    Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
    Capabilities: [2a0 v1] Access Control Services
        ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
geerlingguy commented 2 years ago
pi@cm4:~ $ dmesg | grep pci
[    1.261278] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.261305] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.261373] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x063fffffff -> 0x00c0000000
[    1.261447] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0400000000
[    1.308507] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
[    1.308896] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    1.308914] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.308940] pci_bus 0000:00: root bus resource [mem 0x600000000-0x63fffffff] (bus address [0xc0000000-0xffffffff])
[    1.309028] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    1.309262] pci 0000:00:00.0: PME# supported from D0 D3hot
[    1.313103] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    1.313417] pci 0000:01:00.0: [1002:1478] type 01 class 0x060400
[    1.313474] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff]
[    1.313873] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[    1.313969] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    1.317679] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    1.318042] pci 0000:02:00.0: [1002:1479] type 01 class 0x060400
[    1.318515] pci 0000:02:00.0: PME# supported from D0 D3hot D3cold
[    1.322211] pci 0000:02:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    1.322530] pci 0000:03:00.0: [1002:73df] type 00 class 0x030000
[    1.322595] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    1.322637] pci 0000:03:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    1.322667] pci 0000:03:00.0: reg 0x20: [io  0x0000-0x00ff]
[    1.322695] pci 0000:03:00.0: reg 0x24: [mem 0x00000000-0x000fffff]
[    1.322724] pci 0000:03:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    1.323058] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    1.323147] pci 0000:03:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    1.323306] pci 0000:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    1.323421] pci 0000:03:00.1: [1002:ab28] type 00 class 0x040300
[    1.323470] pci 0000:03:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
[    1.323795] pci 0000:03:00.1: PME# supported from D1 D2 D3hot D3cold
[    1.327530] pci_bus 0000:03: busn_res: [bus 03-ff] end is updated to 03
[    1.327555] pci_bus 0000:02: busn_res: [bus 02-ff] end is updated to 03
[    1.327576] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 03
[    1.327628] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    1.327644] pci 0000:00:00.0: BAR 8: assigned [mem 0x618000000-0x6182fffff]
[    1.327665] pci 0000:01:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    1.327680] pci 0000:01:00.0: BAR 8: assigned [mem 0x618000000-0x6181fffff]
[    1.327696] pci 0000:01:00.0: BAR 0: assigned [mem 0x618200000-0x618203fff]
[    1.327716] pci 0000:01:00.0: BAR 7: no space for [io  size 0x1000]
[    1.327729] pci 0000:01:00.0: BAR 7: failed to assign [io  size 0x1000]
[    1.327747] pci 0000:02:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    1.327761] pci 0000:02:00.0: BAR 8: assigned [mem 0x618000000-0x6181fffff]
[    1.327774] pci 0000:02:00.0: BAR 7: no space for [io  size 0x1000]
[    1.327786] pci 0000:02:00.0: BAR 7: failed to assign [io  size 0x1000]
[    1.327805] pci 0000:03:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    1.327844] pci 0000:03:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    1.327880] pci 0000:03:00.0: BAR 5: assigned [mem 0x618000000-0x6180fffff]
[    1.327902] pci 0000:03:00.0: BAR 6: assigned [mem 0x618100000-0x61811ffff pref]
[    1.327917] pci 0000:03:00.1: BAR 0: assigned [mem 0x618120000-0x618123fff]
[    1.327936] pci 0000:03:00.0: BAR 4: no space for [io  size 0x0100]
[    1.327949] pci 0000:03:00.0: BAR 4: failed to assign [io  size 0x0100]
[    1.327964] pci 0000:02:00.0: PCI bridge to [bus 03]
[    1.327987] pci 0000:02:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[    1.328007] pci 0000:02:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    1.328032] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[    1.328053] pci 0000:01:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[    1.328072] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    1.328096] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[    1.328115] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[    1.328131] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    1.328349] pci 0000:03:00.1: D0 power state depends on 0000:03:00.0
geerlingguy commented 2 years ago

While compiling on kernel version 5.10 from the raspberrypi/linux tree, I noticed an error:

  AR      drivers/ptp/built-in.a
  CC [M]  drivers/i2c/busses/i2c-brcmstb.o
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c: In function 'amdgpu_dm_atomic_commit_tail':
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7757:4: error: implicit declaration of function 'is_hdr_metadata_different'; did you mean 'is_scaling_state_different'? [-Werror=implicit-function-declaration]
    is_hdr_metadata_different(old_con_state, new_con_state);
    ^~~~~~~~~~~~~~~~~~~~~~~~~
    is_scaling_state_different
  CC [M]  drivers/media/i2c/cx25840/cx25840-firmware.o
  CC [M]  drivers/media/i2c/cx25840/cx25840-vbi.o
  AR      drivers/i2c/muxes/built-in.a
...
  LD [M]  drivers/media/dvb-frontends/drxd.o
  LD [M]  drivers/media/dvb-frontends/stv0900.o
  LD [M]  drivers/media/dvb-frontends/cxd2820r.o
  LD [M]  drivers/media/dvb-frontends/drxk.o
make: *** [Makefile:1825: drivers] Error 2
6by9 commented 2 years ago

Looks like it was missed in https://github.com/raspberrypi/linux/commit/6bd46342fadfdfb0a40d674f9161104f2e691873 which removed is_hdr_metadata_different for the generic helper function drm_connector_atomic_hdr_metadata_equal.

geerlingguy commented 2 years ago

2nd Attempt:

  1. Recompiled kernel on rpi-5.14.y branch with AMDGPU selected. Seemed to work.
  2. Copied over to Pi.
  3. Installed sudo apt install -y firmware-amd-graphics
  4. Blacklisted amdgpu via /etc/modprobe.d/blacklist-amdgpu.conf

Rebooting...

geerlingguy commented 2 years ago

Without the card plugged in, a sudo modprobe amdgpu gets me:

[  431.751110] [drm] amdgpu kernel modesetting enabled.

Now trying with the card plugged in...

geerlingguy commented 2 years ago

Good news! The Pi doesn't completely lock up and halt now... it errors out then goes back to letting me debug. Makes test cycles oh-so-much-simpler:

In one terminal:

pi@cm4:~ $ sudo modprobe amdgpu

And in the other:

pi@cm4:~ $ dmesg --follow
...
[   83.281692] [drm] amdgpu kernel modesetting enabled.
[   83.282319] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   83.282361] pci 0000:01:00.0: enabling device (0000 -> 0002)
[   83.282398] pci 0000:02:00.0: enabling device (0000 -> 0002)
[   83.282430] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[   83.282453] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[   83.282474] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   83.282543] [drm] register mmio base: 0x18000000
[   83.282554] [drm] register mmio size: 1048576
[   83.282578] [drm] PCIE atomic ops is not supported
[   83.284144] [drm] add ip block number 0 <nv_common>
[   83.284150] [drm] add ip block number 1 <gmc_v10_0>
[   83.284373] [drm] add ip block number 2 <navi10_ih>
[   83.284395] [drm] add ip block number 3 <psp>
[   83.284401] [drm] add ip block number 4 <smu>
[   83.284419] [drm] add ip block number 5 <gfx_v10_0>
[   83.284425] [drm] add ip block number 6 <sdma_v5_2>
[   83.284431] [drm] add ip block number 7 <vcn_v3_0>
[   83.284435] [drm] add ip block number 8 <jpeg_v3_0>
[   83.319061] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM
[   83.319078] amdgpu: ATOM BIOS: 113-D5121100-101
[   83.319115] [drm] VCN(0) decode is enabled in VM mode
[   83.319121] [drm] VCN(0) encode is enabled in VM mode
[   83.319127] [drm] JPEG decode is enabled in VM mode
[   83.319148] [drm] GPU posting now...
[   83.319230] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   83.319265] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x610000000-0x6101fffff 64bit pref]
[   83.319275] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x600000000-0x60fffffff 64bit pref]
[   83.319324] pci 0000:02:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319332] pci 0000:01:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319343] pci 0000:00:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319362] pci 0000:00:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   83.319369] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   83.319378] pci 0000:01:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   83.319383] pci 0000:01:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   83.319391] pci 0000:02:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   83.319397] pci 0000:02:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   83.319406] amdgpu 0000:03:00.0: BAR 0: no space for [mem size 0x400000000 64bit pref]
[   83.319411] amdgpu 0000:03:00.0: BAR 0: failed to assign [mem size 0x400000000 64bit pref]
[   83.319419] amdgpu 0000:03:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[   83.319424] amdgpu 0000:03:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[   83.319431] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   83.319442] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   83.319456] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   83.319465] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   83.319473] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319483] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[   83.319494] pci 0000:01:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   83.319504] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319517] pci 0000:02:00.0: PCI bridge to [bus 03]
[   83.319529] pci 0000:02:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   83.319538] pci 0000:02:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319566] [drm] Not enough PCI address space for a large BAR.
[   83.319573] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[   83.319595] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[   83.319625] amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   83.319633] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   83.319641] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[   83.319649] [drm] Detected VRAM RAM=12272M, BAR=256M
[   83.319654] [drm] RAM width 192bits GDDR6
[   83.319767] [drm] amdgpu: 12272M of VRAM memory ready
[   83.319775] [drm] amdgpu: 2845M of GTT memory ready.
[   83.319794] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   83.319943] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   83.322016] amdgpu 0000:03:00.0: Direct firmware load for amdgpu/navy_flounder_sos.bin failed with error -2
[   83.322037] amdgpu 0000:03:00.0: amdgpu: failed to init sos firmware
[   83.322044] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware!
[   83.322472] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <psp> failed -2
[   83.322795] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[   83.322802] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[   83.322808] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[   83.323187] amdgpu: probe of 0000:03:00.0 failed with error -2
[   83.323329] [drm] amdgpu: ttm finalized
geerlingguy commented 2 years ago

Hmm... firmware-amd-graphics might not include firmware for the RX 6700 XT (see https://github.com/NixOS/nixpkgs/issues/122776), since the card is new enough to not have been packaged in whatever build that package is based on :(

See more: Radeon RX 6700 XT "Navy Flounder" Microcode Lands In Linux-Firmware.Git, and the commit where firmware was added. (Good ol' Phoronix)

geerlingguy commented 2 years ago

First time doing this (grabbing newer firmware from the linux-firmware repo):

  1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
  2. sudo cp linux-firmware/amdgpu/navy_flounder* /lib/firmware/amdgpu
  3. sudo reboot

And now trying again...

geerlingguy commented 2 years ago

Okay, earlier firmware bug gave me false hope. We're still crashing and burning:

[   85.221462] [drm] amdgpu kernel modesetting enabled.
[   85.221843] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   85.221866] pci 0000:01:00.0: enabling device (0000 -> 0002)
[   85.221886] pci 0000:02:00.0: enabling device (0000 -> 0002)
[   85.221904] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[   85.221916] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[   85.221929] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   85.221965] [drm] register mmio base: 0x18000000
[   85.221970] [drm] register mmio size: 1048576
[   85.221984] [drm] PCIE atomic ops is not supported
[   85.223501] [drm] add ip block number 0 <nv_common>
[   85.223508] [drm] add ip block number 1 <gmc_v10_0>
[   85.223513] [drm] add ip block number 2 <navi10_ih>
[   85.223518] [drm] add ip block number 3 <psp>
[   85.223524] [drm] add ip block number 4 <smu>
[   85.223530] [drm] add ip block number 5 <gfx_v10_0>
[   85.223535] [drm] add ip block number 6 <sdma_v5_2>
[   85.223540] [drm] add ip block number 7 <vcn_v3_0>
[   85.223545] [drm] add ip block number 8 <jpeg_v3_0>
[   85.258238] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM
[   85.258256] amdgpu: ATOM BIOS: 113-D5121100-101
[   85.258293] [drm] VCN(0) decode is enabled in VM mode
[   85.258298] [drm] VCN(0) encode is enabled in VM mode
[   85.258304] [drm] JPEG decode is enabled in VM mode
[   85.258324] [drm] GPU posting now...
[   85.258413] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   85.258451] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x610000000-0x6101fffff 64bit pref]
[   85.258461] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x600000000-0x60fffffff 64bit pref]
[   85.258510] pci 0000:02:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258517] pci 0000:01:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258524] pci 0000:00:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258545] pci 0000:00:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   85.258551] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   85.258560] pci 0000:01:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   85.258566] pci 0000:01:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   85.258574] pci 0000:02:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   85.258580] pci 0000:02:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   85.258588] amdgpu 0000:03:00.0: BAR 0: no space for [mem size 0x400000000 64bit pref]
[   85.258594] amdgpu 0000:03:00.0: BAR 0: failed to assign [mem size 0x400000000 64bit pref]
[   85.258601] amdgpu 0000:03:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[   85.258607] amdgpu 0000:03:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[   85.258614] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   85.258624] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   85.258638] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   85.258647] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   85.258655] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258665] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[   85.258676] pci 0000:01:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   85.258686] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258699] pci 0000:02:00.0: PCI bridge to [bus 03]
[   85.258710] pci 0000:02:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   85.258720] pci 0000:02:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258747] [drm] Not enough PCI address space for a large BAR.
[   85.258754] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[   85.258775] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[   85.258804] amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   85.258813] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   85.258820] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[   85.258828] [drm] Detected VRAM RAM=12272M, BAR=256M
[   85.258834] [drm] RAM width 192bits GDDR6
[   85.258945] [drm] amdgpu: 12272M of VRAM memory ready
[   85.258953] [drm] amdgpu: 2845M of GTT memory ready.
[   85.258971] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   85.259113] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
geerlingguy commented 2 years ago

It does seem like it's running out of address space for a large BAR:

[   85.258747] [drm] Not enough PCI address space for a large BAR.
[   85.258828] [drm] Detected VRAM RAM=12272M, BAR=256M

But that doesn't seem to be the issue here.

geerlingguy commented 2 years ago

Added a few debug lines, and things were a little different!

[  115.560635] [drm] amdgpu: 12272M of VRAM memory ready
[  115.560677] [drm] amdgpu: 2845M of GTT memory ready.
[  115.560718] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  115.560755] DEBUG: Passed gmc_v10_0_hw_init 1069 
[  115.560973] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[  115.560984] DEBUG: Passed gmc_v10_0_hw_init 1078 
[  115.587372] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[  116.615220] ------------[ cut here ]------------
[  116.615231] Firmware transaction timeout
[  116.615282] WARNING: CPU: 3 PID: 37 at drivers/firmware/raspberrypi.c:67 rpi_firmware_transaction+0xdc/0x108
[  116.615301] Modules linked in: amdgpu(+) drm_ttm_helper ttm i2c_algo_bit rfcomm bnep hci_uart btbcm bluetooth ecdh_generic ecc fuse 8021q garp stp llc snd_soc_hdmi_codec brcmfmac brcmutil v3d vc4 cec cfg80211 bcm2835_codec(C) drm_kms_helper gpu_sched rfkill snd_soc_core drm raspberrypi_hwmon v4l2_mem2mem snd_compress snd_bcm2835(C) bcm2835_v4l2(C) drm_panel_orientation_quirks bcm2835_isp(C) videobuf2_vmalloc snd_pcm_dmaengine bcm2835_mmal_vchiq(C) videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common i2c_brcmstb snd_pcm videodev snd_timer dwc2 mc vc_sm_cma(C) snd syscopyarea sysfillrect sysimgblt roles fb_sys_fops backlight rpivid_mem uio_pdrv_genirq uio nvmem_rmem i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd ip_tables x_tables ipv6
[  116.615461] CPU: 3 PID: 37 Comm: kworker/3:1 Tainted: G         C        5.14.2-v8+ #1
[  116.615467] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[  116.615472] Workqueue: events dbs_work_handler
[  116.615485] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[  116.615490] pc : rpi_firmware_transaction+0xdc/0x108
[  116.615495] lr : rpi_firmware_transaction+0xdc/0x108
[  116.615499] sp : ffffffc0117639c0
[  116.615502] x29: ffffffc0117639c0 x28: ffffffc011763d20 x27: 0000000000000000
[  116.615512] x26: ffffff8042fddd00 x25: ffffff80409cdd00 x24: ffffffc011a7e008

Not sure what PSP runtime database doesn't exist means, but the Firmware transaction timeout seems related to the Pi's own firmware?

geerlingguy commented 2 years ago

Tried: sudo SKIP_KERNEL=1 rpi-update, then rebooted. Now it's just hanging at:

[  115.560984] DEBUG: Passed gmc_v10_0_hw_init 1078 

And the green ACT light on the IO board just stays lit green.

geerlingguy commented 2 years ago

Trying a few more times, with various debug statements. I can definitely get to gmc_v10_0_hw_init but I'm trying to dig around and see where the code is calling that through the amd_ip_funcs struct.

Anyways, sometimes I get back to:

[   96.885394] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
geerlingguy commented 2 years ago

Another run with some more debugging:

[   59.061056] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   59.061084] DEBUG: Passed gmc_v10_0_hw_init 1075 
[   59.061216] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   59.061222] DEBUG: Passed gmc_v10_0_hw_init 1084 
[   59.061784] DEBUG: Passed psp_sw_init 250 
[   59.083186] DEBUG: Passed psp_sw_init 266 
[   59.083216] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   59.083223] DEBUG: Passed psp_sw_init 289 
[   61.088295] ------------[ cut here ]------------
[   61.088317] Firmware transaction timeout
[   61.088366] WARNING: CPU: 3 PID: 98 at drivers/firmware/raspberrypi.c:67 rpi_firmware_transaction+0xdc/0x108
[   61.088392] Modules linked in: amdgpu(+) drm_ttm_helper ttm i2c_algo_bit rfcomm bnep hci_uart btbcm bluetooth ecdh_generic ecc fuse 8021q garp stp llc snd_soc_hdmi_codec brcmfmac vc4 brcmutil cec v3d drm_kms_helper gpu_sched drm cfg80211 rfkill drm_panel_orientation_quirks bcm2835_codec(C) bcm2835_v4l2(C) bcm2835_isp(C) bcm2835_mmal_vchiq(C) v4l2_mem2mem videobuf2_vmalloc videobuf2_dma_contig raspberrypi_hwmon videobuf2_memops videobuf2_v4l2 snd_soc_core i2c_brcmstb videobuf2_common dwc2 roles videodev snd_compress snd_bcm2835(C) mc snd_pcm_dmaengine vc_sm_cma(C) snd_pcm snd_timer snd syscopyarea sysfillrect sysimgblt fb_sys_fops rpivid_mem backlight uio_pdrv_genirq uio nvmem_rmem i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd ip_tables x_tables ipv6
[   61.088679] CPU: 3 PID: 98 Comm: kworker/3:2 Tainted: G         C        5.14.2-v8+ #1
[   61.088690] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[   61.088698] Workqueue: events dbs_work_handler
[   61.088718] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[   61.088727] pc : rpi_firmware_transaction+0xdc/0x108
[   61.088736] lr : rpi_firmware_transaction+0xdc/0x108
[   61.088744] sp : ffffffc011be39c0
[   61.088749] x29: ffffffc011be39c0 x28: ffffffc011be3d20 x27: 0000000000000000
[   61.088768] x26: ffffff8058594d80 x25: ffffff80409cdd00 x24: ffffffc011a7d008
[   61.088785] x23: 0000000000001000 x22: ffffff80409cdd00 x21: 00000000ffffff92
[   61.088802] x20: ffffffc01146f520 x19: ffffffc0112f8948 x18: 0000000000000000
[   61.088818] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   61.088833] x14: 0000000000000000 x13: 74756f656d697420 x12: ffffffc0113862c8
[   61.088849] x11: 0000000000000003 x10: ffffffc01136e288 x9 : ffffffc0100e6f00
[   61.088866] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : ffffffc011be3650
[   61.088882] x5 : ffffffc0ea7b0000 x4 : 0000000000000000 x3 : 0000000000000001
[   61.088897] x2 : 0000000000000000 x1 : 20ef52a5bc805600 x0 : 0000000000000000
[   61.088913] Call trace:
[   61.088918]  rpi_firmware_transaction+0xdc/0x108
[   61.088926]  rpi_firmware_property_list+0xc0/0x180
[   61.088935]  rpi_firmware_property+0x78/0x110
[   61.088942]  raspberrypi_fw_set_rate+0x5c/0xd8
[   61.088953]  clk_change_rate+0xdc/0x4e8
[   61.088965]  clk_core_set_rate_nolock+0x1e4/0x238
[   61.088975]  clk_set_rate+0x44/0xb8
[   61.088984]  _set_opp+0x230/0x4f8
[   61.088996]  dev_pm_opp_set_rate+0x128/0x190
[   61.089007]  set_target+0x38/0x48

(Hit that same Pi firmware issue, but system is still hard locked up.)

Looks like it might be failing somewhere in here:

static int psp_sw_init(void *handle)
...
    if (mem_training_ctx->enable_mem_training) {
        ret = psp_memory_training_init(psp);
        if (ret) {
            DRM_ERROR("Failed to initialize memory training!\n");
            return ret;
        }

        ret = psp_mem_training(psp, PSP_MEM_TRAIN_COLD_BOOT);
        if (ret) {
            DRM_ERROR("Failed to process memory training!\n");
            return ret;
        }
    }
geerlingguy commented 2 years ago

Opened an issue on the 'official' tracker: Freedesktop GitLab - Can't get RX 6700 XT running on Raspberry Pi CM4.

elmeyer commented 2 years ago

The way I read this log is that the actual panic occurs when the Raspberry Pi itself is setting some clockspeed (PCIE bus? its own CPU? But why would that fail…) through a firmware call that times out. I think that’s why we’re not seeing that DRM error about failed memory training being printed, which leads me to believe we’re seeing the crashes occur at random points again? Smells familiar…

geerlingguy commented 2 years ago

Which leads me to believe we’re seeing the crashes occur at random points again? Smells familiar…

Indeed, I'm running through a few more tests just to see if I can get consistent results (with a tons of .5s delays mixed in).

I just checked before I was going to load amdgpu again, and saw these two errors too (completely random, a few minutes after booting the Pi, hadn't touched it):

[  610.888425] ------------[ cut here ]------------
[  610.888447] fw-clk-m2mc already disabled
[  610.888492] WARNING: CPU: 3 PID: 86 at drivers/clk/clk.c:960 clk_core_disable+0x258/0x290
...
[  610.889440] fw-clk-m2mc already unprepared
[  610.889474] WARNING: CPU: 3 PID: 86 at drivers/clk/clk.c:819 clk_core_unprepare+0x23c/0x260

And looking back, those same two errors occurred 10 seconds into the boot cycle. PCIe bus seems to not be up either on this boot:

[    1.228140] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.228179] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.228265] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x063fffffff -> 0x00c0000000
[    1.228355] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0400000000
[    1.545482] brcm-pcie fd500000.pcie: link down

But a reboot brings it right back.

geerlingguy commented 2 years ago

I'm also adding .5s delays with two lines like the following:

    printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
    msleep(500);

And it looks like I can very consistently reach:

[   76.507503] [drm] Detected VRAM RAM=12272M, BAR=256M
[   76.507508] [drm] RAM width 192bits GDDR6
[   76.507617] [drm] amdgpu: 12272M of VRAM memory ready
[   76.507625] [drm] amdgpu: 2845M of GTT memory ready.
[   76.507643] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   76.507672] DEBUG: Passed gmc_v10_0_hw_init 1075 
[   76.507796] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   76.507803] DEBUG: Passed gmc_v10_0_hw_init 1084 
[   76.508260] DEBUG: Passed psp_sw_init 262 
[   77.046534] DEBUG: Passed psp_sw_init 279 
[   77.564552] DEBUG: Passed psp_get_runtime_db_entry 201 
[   78.076551] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   78.076566] DEBUG: Passed psp_sw_init 303 
[   78.588509] DEBUG: Passed psp_sw_init 308 
[   79.100496] DEBUG: Passed psp_sw_init 317

The next block of code, which does not run, is:

        ret = psp_mem_training(psp, PSP_MEM_TRAIN_COLD_BOOT);
        if (ret) {
            DRM_ERROR("Failed to process memory training!\n");
            return ret;
        }
geerlingguy commented 2 years ago

Debugging psp_v11_0_memory_training now:

[   26.845578] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   26.845590] DEBUG: Passed psp_sw_init 303 
[   27.357576] DEBUG: Passed psp_sw_init 308 
[   27.869578] DEBUG: Passed psp_sw_init 317 
[   28.381584] DEBUG: Passed psp_v11_0_memory_training 612 
[   28.893586] DEBUG: Passed psp_v11_0_memory_training 623 
[   29.405609] DEBUG: Passed psp_v11_0_memory_training 634 
[   29.917580] DEBUG: Passed psp_v11_0_memory_training 642 
[   30.429593] DEBUG: Passed psp_v11_0_memory_training 650 
[   30.941605] DEBUG: Passed psp_v11_0_memory_training 658 
[   31.453586] DEBUG: Passed psp_v11_0_memory_training 667 
[   31.965598] DEBUG: Passed psp_v11_0_memory_training 677 
[   32.477579] DEBUG: Passed psp_v11_0_memory_training 686 
[   32.989579] DEBUG: Passed psp_v11_0_memory_training 694 
[   33.501583] DEBUG: Passed psp_v11_0_memory_training 708 
[   34.013581] DEBUG: Passed psp_v11_0_memory_training 718 
[   34.526817] DEBUG: Passed psp_v11_0_memory_training 727 

It looks like it's hitting this portion of code:

static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
...
    if (drm_dev_enter(&adev->ddev, &idx)) {
            memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
            ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
            if (ret) {
                DRM_ERROR("Send long training msg failed.\n");
                vfree(buf);
                drm_dev_exit(idx);
                return ret;
            }

memcpy_fromio() seems the likely culprit?

Edit: It seems like every time with debug statements around it, the system halts on the line:

memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
geerlingguy commented 2 years ago

Maybe it's time for me to read through the entire Linux Device Drivers book on PCIe memory access?

geerlingguy commented 2 years ago

Trimming down the debug to just before the memcpy_fromio() line:

static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
...
        if (drm_dev_enter(&adev->ddev, &idx)) {
            printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
            printk(KERN_ALERT "DEBUG: addr %p, value %u, count %d \n",buf,adev->mman.aper_base_kaddr,sz);
            msleep(500);

            memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);

I see:

[   48.987688] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   48.988976] DEBUG: Passed psp_v11_0_memory_training 692 
[   48.988991] DEBUG: addr 0000000022ac6957, value 536870912, count 33554432 
[   51.837474] ------------[ cut here ]------------
[   51.837490] Firmware transaction timeout
[   51.837532] WARNING: CPU: 1 PID: 177 at drivers/firmware/raspberrypi.c:67 rpi_firmware_transaction+0xdc/0x108
geerlingguy commented 2 years ago

Added an issue on the Raspberry Pi Forums too: Having trouble with AMD Radeon RX 6700 XT on CM4.

elmeyer commented 2 years ago

This smells so familiar that you may have to start writing single-byte loops.

geerlingguy commented 2 years ago

@elmeyer - Seems like a different issue.

I replaced:

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index bc133db2d538..3c34949222a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -689,7 +689,19 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
        }

        if (drm_dev_enter(&adev->ddev, &idx)) {
-           memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
+           printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
+           printk(KERN_ALERT "DEBUG: addr %p, value %u, count %d \n",buf,adev->mman.aper_base_kaddr,sz);
+           msleep(500);
+
+           int pos;
+           for(pos = 0;pos < sz; pos++){
+               memcpy_fromio(buf+pos,adev->mman.aper_base_kaddr+pos,1);
+           }
+           // memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
+
+           printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
+           msleep(500);
+
            ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
            if (ret) {
                DRM_ERROR("Send long training msg failed.\n");

And it output:

[   70.605375] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   70.632286] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   70.633553] DEBUG: Passed psp_v11_0_memory_training 692 
[   70.633569] DEBUG: addr 00000000a6e09a10, value 536870912, count 33554432

But didn't get any further AFAICT.

geerlingguy commented 2 years ago

Just noting someone else who was working with memcopy_fromio/toio and experiencing hard crashes: https://stackoverflow.com/questions/28518336/how-do-i-use-memcpy-toio-fromio#comment45366546_28518336

Coreforge commented 2 years ago

To test where it gets to easier without recompiling the kernel each time, you could use a jtag adapter or a second pi as one with openOCD. You'll have to set the maximum number of cores used to 1 because at least with the configuration I used, I could only access the first one. You might also want to suppress RCU stalls as the kernel complains about them when single stepping.

Crashing at memcpy_*io seems right to me as it's the same with the radeon driver.

geerlingguy commented 2 years ago

Over on the Pi forums, got the following response from jdb:

"Firmware transaction timeout" usually means the VPU has crashed.

From your linked issue on freedesktop.org, the memcpy_fromio boils down to this: https://elixir.bootlin.com/linux/latest/source/arch/arm/lib/copy_template.S

Which uses optimised loads and stores to access the PCIe outbound window.

This won't work on a CM4. At best you get garbage in the read data, at worst you trash the internal bus between CPU and PCIe - which is what seems to be happening because the VPU sometimes fails while the CPU trundles on.

You need to use dword-sized transfers only, readl()/writel().

It looks like @Coreforge did something similar here for the radeon driver.

Coreforge commented 2 years ago

There are a few memcpy_toio and memcpy_fromio in the driver that likely will cause issues. The best way is probably to put a 32-bit version of the two functions and replace the calls with calls to the 32-bit versions (instead of doing what I did and having multiple functions that do exactly the same in multiple places)

Coreforge commented 2 years ago

I haven't tried it out yet (it compiles fine though), but this patch should replace all memcpy_toio, memcpy_fromio and memset_io calls with calls to a version that doesn't so 64-bit accesses. There might still be issues with BOs (might be very simillar to how it is with the radeon driver). If there aren't any surprise problems, this might get a framebuffer though. It also looks like the amdgpu driver doesn't check the atombios signature like the radeon driver does, which means that the driver might have used a corrupted bios, but it just didn't complain.

geerlingguy commented 2 years ago

(instead of doing what I did and having multiple functions that do exactly the same in multiple places)

Hehe, as I was manually copying your patch to my branch, I was like "aaargh not this again!" But you do what you do to get past a bump, optimization comes later ;)

I'll test your patch soon, and report back with results.

Edit: Attaching a patch manually applied to the rpi-5.14.y branch, since the patch you had seemed to apply only to 5.10.y: coreforge-amdgpu-mem-to-32bits.txt

geerlingguy commented 2 years ago

Hmm... even with that patch, it hangs at:

[   94.881047] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   94.881061] DEBUG: Passed psp_sw_init 296 
[   95.390206] DEBUG: Passed psp_v11_0_memory_training 692 
[   95.390258] DEBUG: addr 00000000903bc443, value 536870912, count 33554432 

Here's the debug code I injected (only the first two alerts get hit still):

        if (drm_dev_enter(&adev->ddev, &idx)) {
            printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
            printk(KERN_ALERT "DEBUG: addr %p, value %u, count %d \n",buf,adev->mman.aper_base_kaddr,sz);
            msleep(500);

            memcpy_fromio_pcie(buf, adev->mman.aper_base_kaddr, sz);

            printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
            msleep(500);

Also switching memcpy_fromio_pcie() to readb and readl made no difference. Maybe a bug in that translation code?

geerlingguy commented 2 years ago

It looks like the memcpy_fromio() function was not always the way it is... https://patchwork.kernel.org/project/linux-arm-kernel/patch/1406701706-12808-1-git-send-email-joonwoop@codeaurora.org/

Coreforge commented 2 years ago

The regular memcpy_io functions don't have memory breaks, but maybe adding some might help. It should at least get past the memcpy if the functions work correctly.

DanielMazurkiewicz commented 2 years ago

https://youtu.be/LO7Ip9VbOLY?t=697 <- look here

My comment might be completely unrelated (as I don't have RPi and I'm not kernel dev ) but expecting to communicate with non properly working IO would sound to me to be an issue in a first place. :-)

geerlingguy commented 2 years ago

Over on the AMD driver issue:

Make sure the PCIe support on the rpi is spec compliant. The spec and the GPU driver require cache coherence with the host CPU. A lot of ARM platforms tend to leave out the relevant IP required for this.

geerlingguy commented 2 years ago

Currently watching each byte get copied to see where it fails:

byte-count

I've decided to try reverting the patch linked a few comments above, back to the old un-optimized memcopy_fromio() behavior, just to see what will happen. It's gonna be a while, but I figure I can do some other things while it's sitting there copying bits and bytes :D

Coreforge commented 2 years ago

Cache coherence is definitely an issue, as the pi doesn't have it. It should be possible though (hopefully) to do it like with the radeon driver and set a flag to mark all buffers as uncached.

geerlingguy commented 2 years ago

@Coreforge - Would the following be enough to disable write combining, you think (in amdgpu_object.c)?

bool amdgpu_bo_support_uswc(u64 bo_flags)
{

// Raspberry Pi doesn't like write combining.
#ifdef defined(CONFIG_ARM64) && defined(CONFIG_ARCH_BCM2835)
    return false;
#endif

Also, after a few attempts (and only printing the count on multiples of 10,000 so it's a lot faster), it seems like every time it tries reading the first byte on that call inside psp_v11_0_memory_training(), it fails, no matter how small a range I try.

Edit: Tried above code and it doesn't seem to make a difference. At this point I'm just going to disable memory training and see what happens next :D

geerlingguy commented 2 years ago

Disabling memory training gets me to:

[   71.321423] [drm] use_doorbell being set to: [true]
[   71.321574] [drm] use_doorbell being set to: [true]
[   71.335710] [drm] Found VCN firmware Version ENC: 1.13 DEC: 2 VEP: 0 Revision: 42
[   71.335771] [drm] PSP loading VCN firmware

(That's hit within vcn_v3_0_sw_init. Not sure where it dies next, will have to pause debugging for now.)

Coreforge commented 2 years ago

The amdgpu driver doesn't pass flags the the bo_create function like radeon does, so it's more difficult to disable write combining and set them as uncached. It might be possible to do it within amdgpu_ttm though, I haven't looked at that yet.

Coreforge commented 2 years ago

Adding caching = ttm_uncached; here should do the same as always forcing the RADEON_GEM_GTT_UC flag in the radeon driver. Write combining would also be disabled that way, if I understand it correctly.

geerlingguy commented 2 years ago

Interesting; with that inserted, I'm getting:

[   85.056663] [drm] Detected VRAM RAM=12272M, BAR=256M
[   85.056668] [drm] RAM width 192bits GDDR6
[   85.056776] [drm] amdgpu: 12272M of VRAM memory ready
[   85.056784] [drm] amdgpu: 2845M of GTT memory ready.
[   85.056804] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   85.056944] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   85.056987] Unable to handle kernel paging request at virtual address ffffffc011c0a000
[   85.057004] Mem abort info:
[   85.057010]   ESR = 0x96000061
[   85.057018]   EC = 0x25: DABT (current EL), IL = 32 bits
[   85.057027]   SET = 0, FnV = 0
[   85.057034]   EA = 0, S1PTW = 0
[   85.057040]   FSC = 0x21: alignment fault
[   85.057048] Data abort info:
[   85.057054]   ISV = 0, ISS = 0x00000061
[   85.057061]   CM = 0, WnR = 1
[   85.057068] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000001135000
[   85.057078] [ffffffc011c0a000] pgd=10000000fbfff003, p4d=10000000fbfff003, pud=10000000fbfff003, pmd=1000000040a35003, pte=006800004732470f
[   85.057110] Internal error: Oops: 96000061 [#1] PREEMPT SMP
[   85.057118] Modules linked in: amdgpu(+) drm_ttm_helper ttm i2c_algo_bit bnep hci_uart btbcm bluetooth ecdh_generic ecc 8021q garp stp llc snd_soc_hdmi_codec brcmfmac brcmutil vc4 v3d cec cfg80211 gpu_sched drm_kms_helper bcm2835_codec(C) bcm2835_isp(C) rfkill bcm2835_v4l2(C) v4l2_mem2mem bcm2835_mmal_vchiq(C) raspberrypi_hwmon snd_soc_core videobuf2_vmalloc videobuf2_dma_contig snd_bcm2835(C) videobuf2_memops snd_compress videobuf2_v4l2 drm snd_pcm_dmaengine snd_pcm drm_panel_orientation_quirks i2c_brcmstb videobuf2_common dwc2 syscopyarea snd_timer sysfillrect sysimgblt fb_sys_fops roles snd backlight videodev mc vc_sm_cma(C) rpivid_mem nvmem_rmem uio_pdrv_genirq uio i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd ip_tables x_tables ipv6
[   85.057296] CPU: 1 PID: 674 Comm: modprobe Tainted: G         C        5.14.2-v8+ #1
[   85.057307] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[   85.057314] pstate: 40000005 (nZcv daif -PAN -UAO -TCO BTYPE=--)
[   85.057323] pc : __memset+0x16c/0x188
[   85.057337] lr : amdgpu_device_init+0x143c/0x1ba0 [amdgpu]
[   85.057786] sp : ffffffc011ed37f0
[   85.057791] x29: ffffffc011ed37f0 x28: 0000000000000001 x27: ffffff804c715518
[   85.057805] x26: ffffff804c700000 x25: ffffff804c708000 x24: ffffff804c704000
[   85.057818] x23: ffffff804c714000 x22: ffffffc0096303a0 x21: ffffffc0112f8948
[   85.057830] x20: ffffff804c710000 x19: ffffff80411c8800 x18: 0000000000000010
[   85.057842] x17: 00000000000015d4 x16: 00000000000015dc x15: ffffffffffffffff
[   85.057855] x14: 0000000000000001 x13: 2e29303030303030 x12: ffffff8041abe880
[   85.057867] x11: ffffff8042f08510 x10: fffffffe00000000 x9 : 0000000000000000
[   85.057879] x8 : ffffffc011c0a000 x7 : 0000000000000000 x6 : 000000000000003f
[   85.057890] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000004
[   85.057902] x2 : 0000000000001fc0 x1 : 0000000000000000 x0 : ffffffc011c0a000
[   85.057915] Call trace:
[   85.057920]  __memset+0x16c/0x188
[   85.057929]  amdgpu_driver_load_kms+0x30/0x2b8 [amdgpu]
[   85.058216]  amdgpu_pci_probe+0xe4/0x1b0 [amdgpu]
[   85.058500]  pci_device_probe+0xc0/0x190
[   85.058514]  really_probe+0xb8/0x318
[   85.058524]  __driver_probe_device+0x80/0xe8
[   85.058531]  driver_probe_device+0x88/0x118
[   85.058539]  __driver_attach+0x78/0x110
[   85.058547]  bus_for_each_dev+0x7c/0xd0
[   85.058554]  driver_attach+0x2c/0x38
[   85.058562]  bus_add_driver+0x194/0x1f8
[   85.058569]  driver_register+0x6c/0x128
[   85.058577]  __pci_register_driver+0x4c/0x58
[   85.058585]  amdgpu_init+0x64/0x1000 [amdgpu]
[   85.058870]  do_one_initcall+0x54/0x2c0
[   85.058880]  do_init_module+0x60/0x248
[   85.058889]  load_module+0x2208/0x2758
[   85.058896]  __do_sys_finit_module+0xbc/0xf8
[   85.058904]  __arm64_sys_finit_module+0x28/0x38
[   85.058912]  invoke_syscall+0x4c/0x110
[   85.058921]  el0_svc_common+0x100/0x128
[   85.058928]  do_el0_svc+0x30/0x98
[   85.058936]  el0_svc+0x24/0x38
[   85.058944]  el0t_64_sync_handler+0x90/0xb8
[   85.058951]  el0t_64_sync+0x178/0x17c
[   85.058961] Code: 91010108 54ffff4a 8b040108 cb050042 (d50b7428) 
[   85.058971] ---[ end trace 9be26052834e9772 ]---

I've tried three times, getting the exact same stack trace each time, just with different memory addresses.

geerlingguy commented 2 years ago

Looks like we're failing inside:

            r = amdgpu_device_wb_init(adev);
            if (r) {
                DRM_ERROR("amdgpu_device_wb_init failed %d\n", r);
                goto init_failed;
            }
            adev->ip_blocks[i].status.hw = true;
Coreforge commented 2 years ago

Try changign this https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L1083 memset to a memset_io_pcie (is this the same one elFarto found with the 550?). I'm just guessing it could be this one, but it's a memset, and it's after a hw_init which can be gart init, which would produce the gart message. It's also something I had to do on radeon (didn't change it earlier because I just looked for memset_io and not memset, which is used quite often).

geerlingguy commented 2 years ago

Testing that now.

Coreforge commented 2 years ago

It should account for unaligned access which I'm not sure if the default memset does (I can't read arm assembler). There would still be the dc zva issue though with memset as elFarto noted in the other issue.

geerlingguy commented 2 years ago

@Coreforge - That gets me back to psp_v11_0_memory_training(), but let me disable memory training again and see what's next now.

Coreforge commented 2 years ago

Probably won't help much with that, but it's good to know disabling caching seems to work. I'll have to look at the memory training stuff.