Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
7.79k stars 287 forks source link

parse_drm_fdinfo_intel Assertion error with multiple Intel GPUs #196

Closed jackyyf closed 1 year ago

jackyyf commented 1 year ago

When I tried to launch nvtop on my workstation, it immediately exits with error message:

nvtop: ./src/extract_gpuinfo_intel.c:228: parse_drm_fdinfo_intel: Assertion `!cache_entry_check && "We should not be processing a client id twice per update"' failed.
Aborted

nvtop version: 3.0.1 (from debian bookworm (testing)) linux kernel version: 6.1.0-4-amd64 (6.1.11-1)

I have a rather strange setup on my workstation with 3GPUs:

I can confirm nvtop works perfectly when I unplug the "Intel DG2 A380" from my workstation, and it has assertion error whenever I plugged it back.

Please let me know if I could help by debugging this issue, as I understand this is a super strange setup. There are some AV1 encoding tasks and somehow A380 is the cheapest option for a hardware encoder :P

Syllo commented 1 year ago

Hello @jackyyf, This is strange since the client_id should be unique per drm instance. It may be that my assumption is wrong. Do you keep the integrated GPU active when the discrete Intel GPU is plugged in?

jackyyf commented 1 year ago

Yes in terms of PCIe devices. No display cable is plugged in to either Intel GPU (Onboard or Discrete), as I mainly use them as transcoding tasks. DG2 has slightly different transcoding capabilities compared to the iGPU so I enabled them both and instruct ffmpeg to select the correct device.

I'll add some more information with lspci and /dev/dri information in case they help:

# lspci | grep -E 'VGA|Display'
00:02.0 Display controller: Intel Corporation AlderLake-S GT1 (rev 0c)
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1)
08:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05)

# lspci -vv -s 00:02.0
00:02.0 Display controller: Intel Corporation AlderLake-S GT1 (rev 0c)
    DeviceName: Onboard - Video
    Subsystem: Micro-Star International Co., Ltd. [MSI] AlderLake-S GT1
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 186
    IOMMU group: 0
    Region 0: Memory at 6223000000 (64-bit, non-prefetchable) [size=16M]
    Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=256M]
    Region 4: I/O ports at 5000 [size=64]
    Capabilities: [40] Vendor Specific Information: Len=0c <?>
    Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
        DevCap: MaxPayload 128 bytes, PhantFunc 0
            ExtTag- RBE+ FLReset+
        DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
    Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
        Address: fee00018  Data: 0000
        Masking: 00000000  Pending: 00000000
    Capabilities: [d0] Power Management version 2
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100 v1] Process Address Space ID (PASID)
        PASIDCap: Exec- Priv-, Max PASID Width: 14
        PASIDCtl: Enable- Exec- Priv-
    Capabilities: [200 v1] Address Translation Service (ATS)
        ATSCap: Invalidate Queue Depth: 00
        ATSCtl: Enable+, Smallest Translation Unit: 00
    Capabilities: [300 v1] Page Request Interface (PRI)
        PRICtl: Enable- Reset-
        PRISta: RF- UPRGI- Stopped+
        Page Request Capacity: 00008000, Page Request Allocation: 00000000
    Capabilities: [320 v1] Single Root I/O Virtualization (SR-IOV)
        IOVCap: Migration- 10BitTagReq- Interrupt Message Number: 000
        IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- 10BitTagReq-
        IOVSta: Migration-
        Initial VFs: 7, Total VFs: 7, Number of VFs: 0, Function Dependency Link: 00
        VF offset: 1, stride: 1, Device ID: 4680
        Supported Page Size: 00000553, System Page Size: 00000001
        Region 0: Memory at 0000004010000000 (64-bit, non-prefetchable)
        Region 2: Memory at 0000004020000000 (64-bit, prefetchable)
        VF Migration: offset: 00000000, BIR: 0
    Kernel driver in use: i915
    Kernel modules: i915

# lspci -vv -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: Gigabyte Technology Co., Ltd TU106 [GeForce RTX 2060 Rev. A]
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 210
    IOMMU group: 15
    Region 0: Memory at 52000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at 6210000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at 6220000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at 4000 [size=128]
    Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee00cf8  Data: 0000
    Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s (downgraded), Width x16
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
             10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
        LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [100 v1] Virtual Channel
        Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:   ArbSelect=Fixed
        Status: InProgress-
        VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
            Status: NegoPending- InProgress-
    Capabilities: [250 v1] Latency Tolerance Reporting
        Max snoop latency: 34326183936ns
        Max no snoop latency: 34326183936ns
    Capabilities: [258 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=281600ns
        L1SubCtl2: T_PwrOn=10us
    Capabilities: [128 v1] Power Budgeting <?>
    Capabilities: [420 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [bb0 v1] Physical Resizable BAR
        BAR 0: current size: 16MB, supported: 16MB
        BAR 1: current size: 256MB, supported: 64MB 128MB 256MB
        BAR 3: current size: 32MB, supported: 32MB
    Kernel driver in use: nvidia
    Kernel modules: nvidia

# lspci -vv -s 08:00.0
08:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05) (prog-if 00 [VGA controller])
    Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A380]
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin ? routed to IRQ 187
    IOMMU group: 22
    Region 0: Memory at 50000000 (64-bit, non-prefetchable) [size=16M]
    Region 2: Memory at 6000000000 (64-bit, prefetchable) [size=8G]
    Expansion ROM at 51000000 [disabled] [size=2M]
    Capabilities: [40] Vendor Specific Information: Len=0c <?>
    Capabilities: [70] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
        DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s, Width x1
            TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
             10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
        LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
        Address: 00000000fee00a58  Data: 0000
        Masking: 00000000  Pending: 00000000
    Capabilities: [d0] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 0
        ARICtl: MFVC- ACS-, Function Group: 0
    Capabilities: [420 v1] Physical Resizable BAR
        BAR 2: current size: 8GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB
    Capabilities: [400 v1] Latency Tolerance Reporting
        Max snoop latency: 3145728ns
        Max no snoop latency: 3145728ns
    Kernel driver in use: i915
    Kernel modules: i915
# tree -a /dev/dri
/dev/dri
├── by-path
│   ├── pci-0000:00:02.0-card -> ../card0
│   ├── pci-0000:00:02.0-render -> ../renderD128
│   ├── pci-0000:01:00.0-card -> ../card2
│   ├── pci-0000:01:00.0-render -> ../renderD130
│   ├── pci-0000:08:00.0-card -> ../card1
│   └── pci-0000:08:00.0-render -> ../renderD129
├── card0
├── card1
├── card2
├── renderD128
├── renderD129
└── renderD130

2 directories, 12 files
Syllo commented 1 year ago

All right. I think that I found the issue. According to the documentation for client-id, "Uniqueness of the value shall be either globally unique, or unique within the scope of each device, in which case drm-pdev shall be present as well." The code assumes that the client-id is globally unique, and from the error you are facing with Intel GPUs, the same client-id seems to exists for different GPUs.

I should be able to do a patch for that this weekend. Hopefully that's what is affecting you.

jackyyf commented 1 year ago

Thanks! I'll be super happy to help in case you want me to test before commit or release :)

Syllo commented 1 year ago

Could you please try the change I made to the intel_multigpu_id_fix branch?

jackyyf commented 1 year ago

It works!

image

Syllo commented 1 year ago

Thanks, merged

jackyyf commented 9 months ago

Hi Syllo, sorry I'll have to reopen this bug, as the hash part is not correctly handled. I've created PR #248 for this.