Closed jackyyf closed 1 year ago
Hello @jackyyf, This is strange since the client_id should be unique per drm instance. It may be that my assumption is wrong. Do you keep the integrated GPU active when the discrete Intel GPU is plugged in?
Yes in terms of PCIe devices. No display cable is plugged in to either Intel GPU (Onboard or Discrete), as I mainly use them as transcoding tasks. DG2 has slightly different transcoding capabilities compared to the iGPU so I enabled them both and instruct ffmpeg to select the correct device.
I'll add some more information with lspci and /dev/dri information in case they help:
# lspci | grep -E 'VGA|Display'
00:02.0 Display controller: Intel Corporation AlderLake-S GT1 (rev 0c)
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1)
08:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05)
# lspci -vv -s 00:02.0
00:02.0 Display controller: Intel Corporation AlderLake-S GT1 (rev 0c)
DeviceName: Onboard - Video
Subsystem: Micro-Star International Co., Ltd. [MSI] AlderLake-S GT1
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 186
IOMMU group: 0
Region 0: Memory at 6223000000 (64-bit, non-prefetchable) [size=16M]
Region 2: Memory at 4000000000 (64-bit, prefetchable) [size=256M]
Region 4: I/O ports at 5000 [size=64]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag- RBE+ FLReset+
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
Address: fee00018 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [d0] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Process Address Space ID (PASID)
PASIDCap: Exec- Priv-, Max PASID Width: 14
PASIDCtl: Enable- Exec- Priv-
Capabilities: [200 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable+, Smallest Translation Unit: 00
Capabilities: [300 v1] Page Request Interface (PRI)
PRICtl: Enable- Reset-
PRISta: RF- UPRGI- Stopped+
Page Request Capacity: 00008000, Page Request Allocation: 00000000
Capabilities: [320 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration- 10BitTagReq- Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- 10BitTagReq-
IOVSta: Migration-
Initial VFs: 7, Total VFs: 7, Number of VFs: 0, Function Dependency Link: 00
VF offset: 1, stride: 1, Device ID: 4680
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 0000004010000000 (64-bit, non-prefetchable)
Region 2: Memory at 0000004020000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: i915
Kernel modules: i915
# lspci -vv -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd TU106 [GeForce RTX 2060 Rev. A]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 210
IOMMU group: 15
Region 0: Memory at 52000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 6210000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 6220000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at 4000 [size=128]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00cf8 Data: 0000
Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (downgraded), Width x16
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 34326183936ns
Max no snoop latency: 34326183936ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=281600ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [bb0 v1] Physical Resizable BAR
BAR 0: current size: 16MB, supported: 16MB
BAR 1: current size: 256MB, supported: 64MB 128MB 256MB
BAR 3: current size: 32MB, supported: 32MB
Kernel driver in use: nvidia
Kernel modules: nvidia
# lspci -vv -s 08:00.0
08:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05) (prog-if 00 [VGA controller])
Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A380]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin ? routed to IRQ 187
IOMMU group: 22
Region 0: Memory at 50000000 (64-bit, non-prefetchable) [size=16M]
Region 2: Memory at 6000000000 (64-bit, prefetchable) [size=8G]
Expansion ROM at 51000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1
TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
Address: 00000000fee00a58 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [d0] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [420 v1] Physical Resizable BAR
BAR 2: current size: 8GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB
Capabilities: [400 v1] Latency Tolerance Reporting
Max snoop latency: 3145728ns
Max no snoop latency: 3145728ns
Kernel driver in use: i915
Kernel modules: i915
# tree -a /dev/dri
/dev/dri
├── by-path
│ ├── pci-0000:00:02.0-card -> ../card0
│ ├── pci-0000:00:02.0-render -> ../renderD128
│ ├── pci-0000:01:00.0-card -> ../card2
│ ├── pci-0000:01:00.0-render -> ../renderD130
│ ├── pci-0000:08:00.0-card -> ../card1
│ └── pci-0000:08:00.0-render -> ../renderD129
├── card0
├── card1
├── card2
├── renderD128
├── renderD129
└── renderD130
2 directories, 12 files
All right.
I think that I found the issue.
According to the documentation for client-id
, "Uniqueness of the value shall be either globally unique, or unique within the scope of each device, in which case drm-pdev shall be present as well."
The code assumes that the client-id is globally unique, and from the error you are facing with Intel GPUs, the same client-id seems to exists for different GPUs.
I should be able to do a patch for that this weekend. Hopefully that's what is affecting you.
Thanks! I'll be super happy to help in case you want me to test before commit or release :)
Could you please try the change I made to the intel_multigpu_id_fix branch?
It works!
Thanks, merged
Hi Syllo, sorry I'll have to reopen this bug, as the hash part is not correctly handled. I've created PR #248 for this.
When I tried to launch nvtop on my workstation, it immediately exits with error message:
nvtop version: 3.0.1 (from debian bookworm (testing)) linux kernel version: 6.1.0-4-amd64 (6.1.11-1)
I have a rather strange setup on my workstation with 3GPUs:
I can confirm nvtop works perfectly when I unplug the "Intel DG2 A380" from my workstation, and it has assertion error whenever I plugged it back.
Please let me know if I could help by debugging this issue, as I understand this is a super strange setup. There are some AV1 encoding tasks and somehow A380 is the cheapest option for a hardware encoder :P