Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
7.95k stars 291 forks source link

Wrong PCIe generation and lane width for some amdgpus #138

Closed bachandi closed 2 years ago

bachandi commented 2 years ago

My AMD Radeon Pro WX 5100 Graphics card reports PCIe GEN 6@1x which is not correct as the card only supports PCIe GEN 3.

I just quickly checked. The card reports linkSpeed=8000 and laneWidth=1 which results in alaneSpeed of 8000 resulting in pcieGen to be set to 6.

In the hwmon file system I checked and the card reports correctly as 8.0 GT/s PCIe which is the correct speed for PCIe Gen 3@1x.

In another machine an amdgpu reports as PCIe GEN 3@16x but is actually PCIe GEN 3@1x and yet on another machine an amdgpu reported as PCIe GEN 3@16x is actually correct.

Maybe there is something off with the pcieGen and laneWidth detection?

Syllo commented 2 years ago

I am genuinely confused to what the kernel/driver is reporting through sysfs. On my side it reports 16.0 GT/s which if I understand should be generation 4. Although my CPU and motherboard does only support version 3. Hence, I thought that I had to divide the 16 by the number of lanes to get the speed on one lane and deduce the generation. I obviously was wrong about that.

I'll investigate and try to come up with a fix.

Syllo commented 2 years ago

Could you please verify that the patch 28fdcd19f66266329160e1a3d55cc0d5f2785bd6, merged in the master branch, fixed this issue?

bachandi commented 2 years ago

Thanks for the quick patch. The PCIe generation is now correct for the three cards I tested but the link width is still wrong for one card connected as PCIe GEN 3@1x but shows as PCIe GEN 3@16x.

But I suspect this could be a driver issue as the pp_dpm_pcie file has also the wrong x16 information:

cat device/pp_dpm_pcie
0: 2.5GT/s, x8 
1: 8.0GT/s, x16 *

Where as lspci -vv gives:

Subsystem: Dell Ellesmere [Radeon Pro WX 5100]
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
    LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
    LnkSta: Speed 8GT/s (ok), Width x1 (downgraded)
        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

It seems the link capabilities are reported for this card in pp_dpm_pcie instead of the actually currently used link width. For a different card PCIe GEN 3@ 1x is reported correctly and another with PCIe GEN 3@16x is also correct.

Syllo commented 2 years ago

Nice. Either way the info in pp_dpm_pcie seems more trustworthy than the one reported by current_link_speed. I opened a bug report for the info discrepancy. We'll see if that was a bug or not.

PIPIPIG233666 commented 1 year ago

Could be my own specific issue but posting here: the kernel I have (xanmod) does not load amdgpu fw by itself, after picking up the correct firmware /sys/class/drm/card0/device/pp_dpm_pcie shows the correct speed.