GPUOpen-LibrariesAndSDKs / MxGPU-Virtualization

MIT License
182 stars 83 forks source link

Fail to enable sriov, status = fffffffb #18

Closed markednmbr1 closed 5 years ago

markednmbr1 commented 5 years ago

Hi All,

I'm trying to set up an AMD Firepro 7150 in proxmox 5.3-11 (which is up to date debian stretch) and I'm getting the following issue when running modprobe gim

[Fri Mar 1 12:31:49 2019] gim info:(enable_sriov:299) Enable SRIOV [Fri Mar 1 12:31:49 2019] gim info:(enable_sriov:300) Enable SRIOV vfs count = 16 [Fri Mar 1 12:31:49 2019] pci 0000:61:02.0: [1002:692f] type 7f class 0xffffff [Fri Mar 1 12:31:49 2019] pci 0000:61:02.0: unknown header type 7f, ignoring device [Fri Mar 1 12:31:50 2019] gim error:(enable_sriov:311) Fail to enable sriov, status = fffffffb [Fri Mar 1 12:31:50 2019] gim error:(set_new_adapter:668) Failed to properly enable SRIOV [Fri Mar 1 12:31:50 2019] gim info:(gim_probe:91) AMD GIM probe: pf_count = 1

Hardware wise I am using an ASRock EPYCD8-2T which has an AMD EPYC 7351P processor.

In the BIOS, I have IOMMU, SR-IOV and ACS enabled.

Can anyone please advise why I might be having this issue? Or is there anything I can try?

Thank you! Mark

vigchand2705 commented 5 years ago

Could you check if ARI is enabled in BIOS?

What is the kernel version?

markednmbr1 commented 5 years ago

Hi Vignesh,

Kernel is 4.18

There is no option for ARI in the bios that I can find.

Thanks, Mark

markednmbr1 commented 5 years ago

lspci seems to show the compatibility though:

42:00.0 0300: 1002:6929 (prog-if 00 [VGA controller]) Subsystem: 1849:6929 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 5 NUMA node: 2 Region 0: Memory at fce0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at fcf4000000 (64-bit, prefetchable) [size=2M] Region 4: I/O ports at 3000 [size=256] Region 5: Memory at eb400000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at eb440000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [a0] MSI: Enable- Count=1/4 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [200 v1] #15 Capabilities: [270 v1] #19 Capabilities: [2b0 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [2c0 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00000020, Page Request Allocation: 00000000 Capabilities: [2d0 v1] Process Address Space ID (PASID) PASIDCap: Exec+ Priv+, Max PASID Width: 10 PASIDCtl: Enable- Exec- Priv- Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [330 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00 VF offset: 16, stride: 1, Device ID: 692f Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 000000fbe0000000 (64-bit, prefetchable) Region 2: Memory at 000000fcf0000000 (64-bit, prefetchable) Region 5: Memory at e7400000 (32-bit, non-prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [400 v1] Vendor Specific Information: ID=0002 Rev=1 Len=070 <?> Kernel modules: amdgpu

vigchand2705 commented 5 years ago

Oh, try blacklisting amdgpu.

markednmbr1 commented 5 years ago

it is blacklisted (lsmod shows it is not loaded)

vigchand2705 commented 5 years ago

Hmm, dont see any obvious problems so far. Any idea where this gets logged from "pci 0000:61:02.0: unknown header type 7f, ignoring device". It doesnt seem to be GIM.

markednmbr1 commented 5 years ago

It is when gim is bringing up the card I think. It is the slot the s7150 is in. (sorry it was moved from the slot when it was 61). Updated log with this slot is:

[Mon Mar 4 14:32:19 2019] gim info:(enable_sriov:299) Enable SRIOV [Mon Mar 4 14:32:19 2019] gim info:(enable_sriov:300) Enable SRIOV vfs count = 16 [Mon Mar 4 14:32:19 2019] pci 0000:42:02.0: [1002:692f] type 7f class 0xffffff [Mon Mar 4 14:32:19 2019] pci 0000:42:02.0: unknown header type 7f, ignoring device [Mon Mar 4 14:32:20 2019] gim error:(enable_sriov:311) Fail to enable sriov, status = fffffffb [Mon Mar 4 14:32:20 2019] gim error:(set_new_adapter:668) Failed to properly enable SRIOV [Mon Mar 4 14:32:20 2019] gim info:(gim_probe:91) AMD GIM probe: pf_count = 1

markednmbr1 commented 5 years ago

The problem was ARI not being enabled. I spoke to ASRock Rack and got an as-yet unreleased BIOS that has the ARI Forwarding option. Enabled it and now all working!

Closing this.