HeyMeco / Rockchip-pcie-devices

Public effort to document PCI-E Device support for Rockchip based (Single Board) Computers
12 stars 0 forks source link

Getting an Nvidia GPU working on ARM (aarch64) Rockchip RK3588 #2

Open HeyMeco opened 3 months ago

HeyMeco commented 3 months ago

Current Status

With Kernel: 6.8.2-edge-rockchip-rk3588 From Image: Armbian_community_24.5.0-trunk.306_Rock-5b_jammy_edge_6.8.2_gnome_desktop

Issues that need to be resolved:

Here are some of the first findings:

lspci -vvvv

0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: NVIDIA Corporation GA102 [GeForce RTX 3090]
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 0
    Region 1: Memory at <unassigned> (64-bit, prefetchable) [disabled]
    Region 3: Memory at <unassigned> (64-bit, prefetchable) [disabled]
    Region 5: I/O ports at 100000 [virtual] [size=128]
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
        DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
             10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [b4] Vendor Specific Information: Len=14 <?>
    Capabilities: [100 v1] Virtual Channel
        Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:   ArbSelect=Fixed
        Status: InProgress-
        VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
            Status: NegoPending- InProgress-
    Capabilities: [250 v1] Latency Tolerance Reporting
        Max snoop latency: 0ns
        Max no snoop latency: 0ns
    Capabilities: [258 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=0ns
        L1SubCtl2: T_PwrOn=10us
    Capabilities: [128 v1] Power Budgeting <?>
    Capabilities: [420 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: LaneErr at lane: 0
    Capabilities: [bb0 v1] Physical Resizable BAR
        BAR 0: current size: 16MB, supported: 16MB
        BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
        BAR 3: current size: 32MB, supported: 32MB
    Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00 v1] Lane Margining at the Receiver <?>
    Capabilities: [e00 v1] Data Link Feature <?>

0000:01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
    Subsystem: NVIDIA Corporation GA102 High Definition Audio Controller
    Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin B routed to IRQ 127
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [78] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
        DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
             10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [100 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [160 v1] Data Link Feature <?>
    Kernel driver in use: snd_hda_intel
    Kernel modules: snd_hda_intel

iomem

0010f000-0010f0ff : 10f000.sram sram@10f000
00200000-efffffff : System RAM
  00410000-01c4ffff : Kernel code
  01c50000-0213ffff : reserved
  02140000-024cffff : Kernel data
  08300000-08328fff : reserved
  0a200000-0bd8bfff : reserved
  d4000000-efffffff : reserved
f0000000-f00fffff : a40000000.pcie config
f0200000-f0ffffff : pcie@fe150000
  f0200000-f020ffff : 0000:00:00.0
f2000000-f20fffff : a40800000.pcie config
f2200000-f2ffffff : pcie@fe170000
  f2200000-f220ffff : 0002:20:00.0
f4000000-f40fffff : a41000000.pcie config
f4200000-f4ffffff : pcie@fe190000
  f4200000-f42fffff : PCI Bus 0004:41
    f4200000-f420ffff : 0004:41:00.0
      f4200000-f420ffff : r8169
    f4210000-f4213fff : 0004:41:00.0
  f4300000-f430ffff : 0004:40:00.0
fc400000-fc407fff : usb@fc400000
  fc400000-fc407fff : xhci-hcd.1.auto usb@fc400000
fc40c100-fc7fffff : fc400000.usb usb@fc400000
fc800000-fc83ffff : fc800000.usb usb@fc800000
fc840000-fc87ffff : fc840000.usb usb@fc840000
fc880000-fc8bffff : fc880000.usb usb@fc880000
fc8c0000-fc8fffff : fc8c0000.usb usb@fc8c0000
fcd00000-fcd07fff : usb@fcd00000
  fcd00000-fcd07fff : xhci-hcd.0.auto usb@fcd00000
fcd0c100-fd0fffff : fcd00000.usb usb@fcd00000
fd600000-fd6fffff : fd600000.sram sram@fd600000
fd880000-fd880fff : fd880000.i2c i2c@fd880000
fd8a0000-fd8a00ff : fd8a0000.gpio gpio@fd8a0000
fd8b0010-fd8b001f : fd8b0010.pwm pwm@fd8b0010
fdc70000-fdc707ff : fdc70000.video-codec video-codec@fdc70000
fdd90000-fdd941ff : fdd90000.vop vop
fdd95000-fdd95fff : fdd90000.vop gamma-lut
fdd97e00-fdd97eff : fdd97e00.iommu iommu@fdd97e00
fdd97f00-fdd97fff : fdd97e00.iommu iommu@fdd97e00
fde80000-fde9ffff : fde80000.hdmi hdmi@fde80000
fe060000-fe06ffff : fe060000.dfi dfi@fe060000
fe150000-fe15ffff : a40000000.pcie apb
fe170000-fe17ffff : a40800000.pcie apb
fe190000-fe19ffff : a41000000.pcie apb
fe2b0000-fe2b3fff : fe2b0000.spi spi@fe2b0000
fe2c0000-fe2c3fff : fe2c0000.mmc mmc@fe2c0000
fe2d0000-fe2d3fff : fe2d0000.mmc mmc@fe2d0000
fe2e0000-fe2effff : fe2e0000.mmc mmc@fe2e0000
fe370000-fe371fff : fe370000.crypto crypto@fe370000
fe378000-fe3781ff : fe378000.rng rng@fe378000
fe470000-fe470fff : fe470000.i2s i2s@fe470000
fe600000-fe60ffff : GICD
fe680000-fe77ffff : GICR
fea10000-fea13fff : dma-controller@fea10000
  fea10000-fea13fff : fea10000.dma-controller dma-controller@fea10000
fea30000-fea33fff : dma-controller@fea30000
  fea30000-fea33fff : fea30000.dma-controller dma-controller@fea30000
feaf0000-feaf00ff : feaf0000.watchdog watchdog@feaf0000
feb20000-feb20fff : feb20000.spi spi@feb20000
feb50000-feb5001f : serial
feb90000-feb9001f : serial
fec00000-fec003ff : fec00000.tsadc tsadc@fec00000
fec10000-fec1ffff : fec10000.adc adc@fec10000
fec20000-fec200ff : fec20000.gpio gpio@fec20000
fec30000-fec300ff : fec30000.gpio gpio@fec30000
fec40000-fec400ff : fec40000.gpio gpio@fec40000
fec50000-fec500ff : fec50000.gpio gpio@fec50000
fec80000-fec80fff : fec80000.i2c i2c@fec80000
fec90000-fec90fff : fec90000.i2c i2c@fec90000
fecc0000-fecc03ff : fecc0000.efuse efuse@fecc0000
fed10000-fed13fff : dma-controller@fed10000
  fed10000-fed13fff : fed10000.dma-controller dma-controller@fed10000
fed60000-fed61fff : fed60000.phy phy@fed60000
fed90000-fed9ffff : fed90000.phy phy@fed90000
fee00000-fee000ff : fee00000.phy phy@fee00000
fee10000-fee100ff : fee10000.phy phy@fee10000
fee20000-fee200ff : fee20000.phy phy@fee20000
fee80000-fee9ffff : fee80000.phy phy@fee80000
ff001000-ff0effff : ff001000.sram sram@ff001000
100000000-1ffffffff : System RAM
2f0000000-2ffffffff : System RAM
  2f6140000-2fedfffff : reserved
  2fee07000-2fee07fff : reserved
  2fee08000-2feefffff : reserved
  2fef02000-2fef03fff : reserved
  2fef04000-2fef04fff : reserved
  2fef05000-2fef19fff : reserved
  2fef1a000-2fef1afff : reserved
  2fef1b000-2ffffffff : reserved
900000000-93fffffff : pcie@fe150000
  900000000-93fffffff : 0000:00:00.0
980000000-9bfffffff : pcie@fe170000
a00000000-a3fffffff : pcie@fe190000
a40000000-a403fffff : a40000000.pcie dbi
a40800000-a40bfffff : a40800000.pcie dbi
a41000000-a413fffff : a41000000.pcie dbi
HeyMeco commented 3 months ago

ubuntu-drivers devices

== /sys/devices/platform/a40000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0 ==
modalias : pci:v000010DEd00002204sv000010DEsd0000147Dbc03sc00i00
vendor   : NVIDIA Corporation
model    : GA102 [GeForce RTX 3090]
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-535-server-open - distro non-free
driver   : nvidia-driver-525 - distro non-free
driver   : nvidia-driver-535-open - distro non-free
driver   : nvidia-driver-535 - distro non-free recommended
driver   : nvidia-driver-550-server-open - distro non-free
driver   : nvidia-driver-545-open - distro non-free
driver   : nvidia-driver-545 - distro non-free
driver   : nvidia-driver-550-open - third-party non-free
driver   : nvidia-driver-525-open - distro non-free
driver   : nvidia-driver-550-server - distro non-free
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-550 - third-party non-free
driver   : xserver-xorg-video-nouveau - distro free builtin
HeyMeco commented 3 months ago

dmesg

dmesg | grep 0000:01:00.0
[    2.819204] pci 0000:01:00.0: [10de:2204] type 00 class 0x030000 PCIe Legacy Endpoint
[    2.819246] pci 0000:01:00.0: BAR 0 [mem 0x00000000-0x00ffffff]
[    2.819280] pci 0000:01:00.0: BAR 1 [mem 0x00000000-0x0fffffff 64bit pref]
[    2.819313] pci 0000:01:00.0: BAR 3 [mem 0x00000000-0x01ffffff 64bit pref]
[    2.819334] pci 0000:01:00.0: BAR 5 [io  0x0000-0x007f]
[    2.819355] pci 0000:01:00.0: ROM [mem 0x00000000-0x0007ffff pref]
[    2.819610] pci 0000:01:00.0: PME# supported from D0 D3hot
[    2.819960] pci 0000:01:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    2.820325] pci 0000:01:00.0: vgaarb: setting as boot VGA device
[    2.820330] pci 0000:01:00.0: vgaarb: bridge control possible
[    2.820334] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    2.831115] pci 0000:01:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can't assign; no space
[    2.831120] pci 0000:01:00.0: BAR 1 [mem size 0x10000000 64bit pref]: failed to assign
[    2.831126] pci 0000:01:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can't assign; no space
[    2.831131] pci 0000:01:00.0: BAR 3 [mem size 0x02000000 64bit pref]: failed to assign
[    2.831136] pci 0000:01:00.0: BAR 0 [mem size 0x01000000]: can't assign; no space
[    2.831141] pci 0000:01:00.0: BAR 0 [mem size 0x01000000]: failed to assign
[    2.831145] pci 0000:01:00.0: ROM [mem size 0x00080000 pref]: can't assign; no space
[    2.831150] pci 0000:01:00.0: ROM [mem size 0x00080000 pref]: failed to assign
[    2.831165] pci 0000:01:00.0: BAR 5 [io  0x100000-0x10007f]: assigned
[    2.833528] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
HeyMeco commented 3 months ago

After a patch from @mariobalanica we got the memory addresses assigned but the driver isn't quite working yet. I do think thats fixable

dmesg | grep -i 0000:01:00.0
[    2.808753] pci 0000:01:00.0: [10de:2204] type 00 class 0x030000 PCIe Legacy Endpoint
[    2.808795] pci 0000:01:00.0: BAR 0 [mem 0x00000000-0x00ffffff]
[    2.808829] pci 0000:01:00.0: BAR 1 [mem 0x00000000-0x0fffffff 64bit pref]
[    2.808863] pci 0000:01:00.0: BAR 3 [mem 0x00000000-0x01ffffff 64bit pref]
[    2.808884] pci 0000:01:00.0: BAR 5 [io  0x0000-0x007f]
[    2.808904] pci 0000:01:00.0: ROM [mem 0x00000000-0x0007ffff pref]
[    2.809169] pci 0000:01:00.0: PME# supported from D0 D3hot
[    2.809520] pci 0000:01:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    2.809868] pci 0000:01:00.0: vgaarb: setting as boot VGA device
[    2.809873] pci 0000:01:00.0: vgaarb: bridge control possible
[    2.809877] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    2.817735] pci 0000:01:00.0: BAR 1 [mem 0x900000000-0x90fffffff 64bit pref]: assigned
[    2.817763] pci 0000:01:00.0: BAR 3 [mem 0x910000000-0x911ffffff 64bit pref]: assigned
[    2.817791] pci 0000:01:00.0: BAR 0 [mem 0x918000000-0x918ffffff]: assigned
[    2.817803] pci 0000:01:00.0: ROM [mem 0x919000000-0x91907ffff pref]: assigned
[    2.817821] pci 0000:01:00.0: BAR 5 [io  0x100000-0x10007f]: assigned
[    2.820089] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
[    6.207738] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
[    6.207790] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[    6.322937] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[   18.164815] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1589)
[   18.164992] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   22.491615] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1589)
[   22.491797] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   33.703174] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   33.703413] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   34.034308] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   34.034559] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   34.725057] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   34.725143] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   35.042806] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   35.043036] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   35.473607] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   35.473766] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   35.805467] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   35.805559] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   36.358124] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   36.358360] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   36.680259] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   36.680533] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   37.183943] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   37.184094] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   37.503979] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   37.504145] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   92.174852] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
[   92.175095] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
HeyMeco commented 3 months ago

Nouveau is looking better:

0000:01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: NVIDIA Corporation GA102 [GeForce RTX 3090]
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 108
    Region 0: Memory at 918000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at 900000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at 910000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at 100000 [size=128]
    Expansion ROM at 919000000 [virtual] [disabled] [size=512K]
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fe670040  Data: 0000
    Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 8GT/s (downgraded), Width x4 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
             10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [b4] Vendor Specific Information: Len=14 <?>
    Capabilities: [100 v1] Virtual Channel
        Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:   ArbSelect=Fixed
        Status: InProgress-
        VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
            Status: NegoPending- InProgress-
    Capabilities: [250 v1] Latency Tolerance Reporting
        Max snoop latency: 0ns
        Max no snoop latency: 0ns
    Capabilities: [258 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=271360ns
        L1SubCtl2: T_PwrOn=10us
    Capabilities: [128 v1] Power Budgeting <?>
    Capabilities: [420 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [bb0 v1] Physical Resizable BAR
        BAR 0: current size: 16MB, supported: 16MB
        BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
        BAR 3: current size: 32MB, supported: 32MB
    Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00 v1] Lane Margining at the Receiver <?>
    Capabilities: [e00 v1] Data Link Feature <?>
    Kernel driver in use: nouveau
    Kernel modules: nouveau

dmesg | grep -i "pci 0000:01:00.0"

[    2.931919] pci 0000:01:00.0: [10de:2204] type 00 class 0x030000 PCIe Legacy Endpoint
[    2.931961] pci 0000:01:00.0: BAR 0 [mem 0x00000000-0x00ffffff]
[    2.931996] pci 0000:01:00.0: BAR 1 [mem 0x00000000-0x0fffffff 64bit pref]
[    2.932030] pci 0000:01:00.0: BAR 3 [mem 0x00000000-0x01ffffff 64bit pref]
[    2.932050] pci 0000:01:00.0: BAR 5 [io  0x0000-0x007f]
[    2.932071] pci 0000:01:00.0: ROM [mem 0x00000000-0x0007ffff pref]
[    2.932328] pci 0000:01:00.0: PME# supported from D0 D3hot
[    2.932681] pci 0000:01:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    2.933039] pci 0000:01:00.0: vgaarb: setting as boot VGA device
[    2.933044] pci 0000:01:00.0: vgaarb: bridge control possible
[    2.933048] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    2.940877] pci 0000:01:00.0: BAR 1 [mem 0x900000000-0x90fffffff 64bit pref]: assigned
[    2.940906] pci 0000:01:00.0: BAR 3 [mem 0x910000000-0x911ffffff 64bit pref]: assigned
[    2.940934] pci 0000:01:00.0: BAR 0 [mem 0x918000000-0x918ffffff]: assigned
[    2.940947] pci 0000:01:00.0: ROM [mem 0x919000000-0x91907ffff pref]: assigned
[    2.940964] pci 0000:01:00.0: BAR 5 [io  0x100000-0x10007f]: assigned

Until we also encounter DRM errors

dmesg | grep -i nouveau

[    2.955213] nouveau 0000:01:00.0: enabling device (0000 -> 0003)
[    2.955312] nouveau 0000:01:00.0: NVIDIA GA102 (b72000a1)
[    3.317492] nouveau 0000:01:00.0: bios: version 94.02.4b.00.0b
[    3.606146] nouveau 0000:01:00.0: bios: M0203E type 0a
[    3.606160] nouveau 0000:01:00.0: fb: 24576 MiB of unknown memory type
[    4.468637] nouveau 0000:01:00.0: DRM: VRAM: 24576 MiB
[    4.468672] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
[    4.468680] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
[    4.468686] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
[    4.468690] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
[    4.469967] nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
[    4.473861] [drm] Initialized nouveau 1.4.0 20120801 for 0000:01:00.0 on minor 1
[    6.610314] nouveau 0000:01:00.0: DRM: core notifier timeout
[    8.611900] nouveau 0000:01:00.0: DRM: core notifier timeout
[   10.612000] nouveau 0000:01:00.0: DRM: wndw-0: timeout
[   10.621007] nouveau 0000:01:00.0: [drm] fb0: nouveaudrmfb frame buffer device
[   11.433062] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops nouveau_drm_exit [nouveau])
[   99.239911] nouveau 0000:01:00.0: Xwayland[1902]: failed to idle channel 2 [Xwayland[1902]]
HeyMeco commented 3 months ago

Back on the proprietary Nvidia Driver

We're now as far as the Raspberry Pi Community:

[    8.600582] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1468)
[    8.600674] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    8.600817] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    8.601012] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device

Full: dmesg | grep -i nv

 nvidia: loading out-of-tree module taints kernel.
[    3.656114] nvidia: module license 'NVIDIA' taints kernel.
[    3.656125] nvidia: module license taints kernel.
[    3.686823] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[    3.689006] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
[    3.689047] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[    4.034784] NVRM: loading NVIDIA UNIX aarch64 Kernel Module  535.161.07  Sat Feb 17 23:29:15 UTC 2024
[    4.044831] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.161.07  Sat Feb 17 22:42:09 UTC 2024
[    4.046235] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    4.055331] NVRM: Chipset not recognized (vendor ID 0x1d87, device ID 0x3588)
[    4.055338] The NVIDIA GPU driver for AArch64 has not been qualified on this platform
               environment.
[    6.195981] input: HDA NVidia HDMI/DP,pcm=3 as /devices/platform/a40000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.1/sound/card1/input6
[    6.196066] input: HDA NVidia HDMI/DP,pcm=7 as /devices/platform/a40000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.1/sound/card1/input7
[    6.196117] input: HDA NVidia HDMI/DP,pcm=8 as /devices/platform/a40000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.1/sound/card1/input8
[    6.196172] input: HDA NVidia HDMI/DP,pcm=9 as /devices/platform/a40000000.pcie/pci0000:00/0000:00:00.0/0000:01:00.1/sound/card1/input9
[    8.600582] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1468)
[    8.600674] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[    8.600817] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[    8.601012] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[   11.114974] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[   11.120344] nvidia-uvm: Loaded the UVM driver, major device number 511.
serhii-nakon commented 2 months ago

@HeyMeco If you will able to run nvidia driver properly, can you check cuda - for example pytorch ngc container from nvidia to work with ML/AI? It will really cool combination I thing.

serhii-nakon commented 2 months ago

@HeyMeco I already have question, does CUDA works in your current setup without DRM?

HeyMeco commented 2 months ago

@serhii-nakon

@HeyMeco I already have question, does CUDA works in your current setup without DRM?

It doesn't.