clbr / radeontop

GNU General Public License v3.0
804 stars 70 forks source link

No VRAM for R9 270X #48

Closed DanielPower closed 3 years ago

DanielPower commented 7 years ago

radeontop outputs "Failed to open DRM node, no VRAM support." then continues to run, but does not display my VRAM usage.

My card is a Club3D R9 270X. It's a very uncommon brand, so someone in the #radeon irc suggested that it may be that my PCI ID is not in the supported devices list. I've attached part of the output of lspci -vv. If any more information is needed, I'll be happy to provide it.

Thank you.

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Curacao XT / Trinidad XT [Radeon R7 370 / R9 270X/370X] (prog-if 00 [VGA controller])
    Subsystem: Hightech Information System Ltd. Device 2336
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 49
    NUMA node: 0
    Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Region 2: Memory at fea00000 (64-bit, non-prefetchable) [size=256K]
    Region 4: I/O ports at e000 [size=256]
    Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
        AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
        AtomicOpsCtl: ReqEn-
        LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee00000  Data: 0000
    Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
    Capabilities: [200 v1] #15
    Capabilities: [270 v1] #19
    Capabilities: [2b0 v1] Address Translation Service (ATS)
        ATSCap: Invalidate Queue Depth: 00
        ATSCtl: Enable+, Smallest Translation Unit: 00
    Capabilities: [2c0 v1] Page Request Interface (PRI)
        PRICtl: Enable- Reset-
        PRISta: RF- UPRGI- Stopped+
        Page Request Capacity: 00000020, Page Request Allocation: 00000000
    Capabilities: [2d0 v1] Process Address Space ID (PASID)
        PASIDCap: Exec+ Priv+, Max PASID Width: 10
        PASIDCtl: Enable- Exec- Priv-
    Kernel driver in use: radeon
    Kernel modules: radeon, amdgpu

EDIT: Updated lspci output

clbr commented 7 years ago

Your output doesn't include the pci id. You need to use "lspci -vnn" for the numbers to be there.

However, google says it's 1002:6810. That pci id is included, so something else is preventing VRAM status. Maybe your kernel or libdrm is too old, or maybe your radeontop build was without amdgpu support.

DanielPower commented 7 years ago

1002:6810 is the same header showing when I run lspci -vnn. So that is not the issue.

I am not using amdgpu, I'm using radeon as my kernel driver, so building without amdgpu support should not be an issue. My kernel is 4.11.9, and libdrm is 2.4.81.

I'm running a fully up to date Archlinux system, and tried both the latest stable radeontop, as well as building from git.

clbr commented 7 years ago

Try adding printfs in detect.c, for example right after drmOpen, to see what it returned.

DanielPower commented 7 years ago

haagch on the #radeon irc just got me to do this as well. I added printf("drmOpen(): %d\n", drm_fd); under line 73 (under drmOpen) the output is drmOpen(): -1

clbr commented 7 years ago

Run it with LIBGL_DEBUG=verbose (sudo LIBGL_DEBUG=verbose radeontop), that should print some info on why drmOpen fails.

DanielPower commented 7 years ago

So it looks like the issue may be in libdrm, rather than radeontop. I'm not entirely sure what this means, but it appears it's falling back to an old version because 1.4 fails. Then it tries to find my card, and fails to find it.

Output:

drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 4, (OK)
drmOpenByBusid: Searching for BusID pci:0000:01:00.0
drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 4, (OK)
drmOpenByBusid: drmOpenMinor returns 4
drmOpenByBusid: Interface 1.4 failed, trying 1.1
drmOpenByBusid: drmGetBusid reports 
drmOpenDevice: node name is /dev/dri/card1
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card2
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card3
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card4
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card5
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card6
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card7
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card8
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card9
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card10
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card11
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card12
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card13
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card14
drmOpenByBusid: drmOpenMinor returns -1
drmOpenDevice: node name is /dev/dri/card15
drmOpenByBusid: drmOpenMinor returns -1
Failed to open DRM node, no VRAM support.
Collecting data, please wait....
clbr commented 7 years ago

Yes, it's either a libdrm or kernel bug (or very unprobably udev). Please open a bug against libdrm, title something like "libdrm fails to get bus id" (do try libdrm git first, though).

The interface version doesn't matter, my system also prints that.

DanielPower commented 7 years ago

Thank you for your help. The issue still occurs with lidrm-git, so I will open an issue with them.

DanielPower commented 7 years ago

I received a response from my bug report on libdrm:

DanielP an orthogonal solution is to simply not use drmOpen. While it works, sometimes, there's a lot of hidden gotchas.

Simply replace the pciaccess + drmOpen with drmDevice2 - see libdrm/tests/drmdevice.c.

Notes:

  • radeon-top does not need to open the card node, hence no need for auth - directly or via xcb
  • using pciaccess, or drmDevice2 with DRM_DEVICE_GET_PCI_REVISION will wake up your discrete GPU, even if you're looking for the stats of you APU - you want to avoid that if possible.
  • do not forget to close the fd - currently it's leaked.

I'm not sure if this means much to you. I am not a C programmer, so I'm not sure how to write the changes they suggest myself.

The bug report is here: https://bugs.freedesktop.org/show_bug.cgi?id=101837

DanielPower commented 7 years ago

I've just tested on my friend's system and he has the exact same issue. His system is running Archlinux with a Radeon HD 6950 running the latest stable Mesa drive.

V10lator commented 6 years ago

Same here with Mesa and libdrm from git on a Radeon 6950.

@clbr "it's either a libdrm or kernel bug (or very unprobably udev)" - No, it's not. I just bisected it: good: fc512703417b443f7d7f631a5cf8b028ed45885e - Allow unprivileged use of radeontop via console bad: 6fb1c5c03c43791b21c0b465f52794d59c0e8db2 - Add a LIBDIR indirection for some x86_64 systems

I know there are commits in between the good and the bad commit: I couldn't test them as gentoo refuses to install them:

 * QA Notice: The following shared libraries lack a SONAME
 * /usr/lib/libradeontop_xcb.so

Files matching a file type that is not allowed:
   usr/lib/libradeontop_xcb.so
 * ERROR: x11-apps/radeontop-9999::gentoo failed:
 *   multilib-strict check failed!

Please fix this ASAP as I need a way to measure VRAM and GTT live to confirm a possible driver bug.

clbr commented 6 years ago

Please look at commit 219a3e6991 in your range. That very commit changed from a hardcoded path to using libdrm's drmOpen. If drmOpen fails for you, then it is not radeontop's fault, as the arguments to it are correct.

clbr commented 6 years ago

(and seriously, "fix this ASAP" on new year's eve?)

V10lator commented 6 years ago

It's a bug in your app if it worked before: There's something called fallback. Weirdly even with a fallback it doesn't work: https://github.com/V10lator/radeontop/commit/21f107e69ba9dc74a818660e6c3bf3ac0f0470d5

This brings me to believe you dont know the real root of the bug and I really don't have the time to read into your codes and debug this for you.

(Sorry for the "ASAP" but as I told I was in urgent need of it (and as you refuse to fix this since month had to hack together a version that works for me on new years eve))

clbr commented 6 years ago

I don't care to trace a likely kernel bug I can't reproduce, on hardware I do not have.

your codes

It is not my code that fails. I don't want to repeat myself, so please see my previous reply.

clbr commented 6 years ago

Or a gimp analogy: gimp version Y used gtk2 and ran on your computer, Y+1 changed to gtk3 and your gtk3 setup fails for some reason. You'd still argue it's gimp's bug.

Fischer-Simon commented 6 years ago

I got the same problem with the VRAM reporting. But when I just open the node and set the driver name to amdgpu it works.

diff --git a/detect.c b/detect.c
index 6e6d7c6..078bbf1 100644
--- a/detect.c
+++ b/detect.c
@@ -83,6 +83,10 @@ unsigned int init_pci(unsigned char bus, const unsigned char forcemem) {
        if (drm_fd < 0 && access("/dev/ati/card0", F_OK) == 0) // fglrx path
                drm_fd = open("/dev/ati/card0", O_RDWR);

+       // Workaround
+       drm_fd = open("/dev/dri/card0", O_RDWR);
+       strcpy(drm_name, "amdgpu");
+
        use_ioctl = 0;
        if (drm_fd >= 0) {
                authenticate_drm(drm_fd);

Edit: My GPU is an AMD RX 460

V10lator commented 6 years ago

"Or a gimp analogy: gimp version Y used gtk2 and ran on your computer, Y+1 changed to gtk3 and your gtk3 setup fails for some reason. You'd still argue it's gimp's bug." Wasn't gtk startet as a toolkit for Gimp and is developed by the same people? Even if that woudnt be the case I would bet Gimp devs would do their best in assisting instead of ignoring. Also you didn't change a underlying toolkit. Last but not least your software seems to be the only code on earth triggering this bug, so you should work hand-in-hand with kernel devs as nobody knows the codes as good as you do. Again: I dont have the time to do your work!

@Fisher42 Setting the driver name is the key here I guess. Will try to do that later on and if it works with the fallback codes I'll do a pull request.

//EDIT: @clbr Think about it this way: Your codes worked before, you know exactly what change triggered the bug and a fallback to the old codes seems plain stupid (not like changing from gtk3 back to gtk2 which I agree would be a stupid fallback), so why are you refusing to do that?

clbr commented 6 years ago

I'm doing this in my free time, which I do not have an abundance of - why do you assume you are the only busy person? I have posted to and am subscribed to the bug linked above, so I'm already doing what you ask, working together. Secondly, this is not the only software that triggers it: many of the libdrm tests can also do so, and X drivers have done so in the past. You can likely write a couple-line sample that just calls drmOpen with your bus string, and have that reproduce it too.

A fallback like your commit would be bad for two reasons: the bus path uses the very same function that fails inside drmOpen, so it would just repeat the failure; and a hardcoded card0 would fail on systems for which the drmOpen commit was made, those with intel+amd or nvidia+amd. Working around a failing dependency is not the spirit of open source, it is to fix problems at their source.

Since I can't reproduce it, and the bug likely lies in the kernel, there is not much I can do. For anyone interested and with afficted hardware, here's what I'd do: follow the setInterfaceVersion and getUnique drm ioctls in the kernel, adding printks to every path to see what fails and how. I can't give more specific instructions without spending a lot of time on it.

anadon commented 6 years ago

At the risk of provocation, I also have the VRAM part not working on my system as well when using an RX 480.

V10lator commented 6 years ago

New libdrm, new kernel and new card (rx 580) but still the same problem. Libdrm devs say "An orthogonal solution is to simply not use drmOpen. While it works, sometimes, there's a lot of hidden gotchas" (Source: https://bugs.freedesktop.org/show_bug.cgi?id=101837#c3 ) - yet @clbr ignores the gotchas and says it has to work his way. Also it's funny to see that other tools have no problem whatsoever to read the VRAM.

@anadon I don't think @clbr will ever bother to fix this, so just use other tools (like radeon-profile: https://github.com/marazmista/radeon-profile ).

anadon commented 6 years ago

@V10lator You're being obnoxious. @clbr They have a point and others have already figured this out and the solution has already been linked above.

clbr commented 6 years ago

As I wrote in that bug's comment 4 on 2017-07-21, the one following the linked comment 3, radeontop does need to open the node and so the solution provided in comment 3 cannot be used.

anadon commented 6 years ago

Please walk us through what information is needed by radeontop that is only available through root access to a node.

V10lator commented 6 years ago

@clbr Did you look how radeon-profile reads out the VRAM? It seems it doesn't use drmOpen (but I didn't look too deeply, so I might be wrong. Anyway, it works there) : https://github.com/marazmista/radeon-profile/blob/master/radeon-profile/dxorg.cpp#L42 https://github.com/marazmista/radeon-profile/blob/master/radeon-profile/dxorg.cpp#L112 https://github.com/marazmista/radeon-profile/blob/master/radeon-profile/dxorg.cpp#L319 https://github.com/marazmista/radeon-profile/blob/master/radeon-profile/ioctlHandler.cpp#L42 https://github.com/marazmista/radeon-profile/blob/master/radeon-profile/ioctl_amdgpu.cpp#L143

@anadon Sorry, I just woke up before writing my last comment. Still I fail to see what was obnoxious as I just pointed out that there are gotchas @clbr should be handling (if he wants to keep the drmOpen path), that other tools are able to read the VRAM without problems and giving you an alternative tool to help you reading out the VRAM usage.

anadon commented 6 years ago

@V10lator You're blaming clbr and backing him into a corner where he is backed into a corner and needs to justify himself very defensively in order to maintain any validity of his opinion and position. Putting people into such a social corner stresses, and reduces how cooperative they are yielding a purely worse situation. You did this in a very direct manner with harsh tone. In the flow of communication these actions are very disruptive, thus fitting the descriptor of 'obnoxious'. There are situations where being so harsh is necessary, but this is not one of them. For further reading, there was a study of effective arguments on r/changemymind which touches on many of these [1] and IBM released a tone analyzer [2] for writing which can help make such writing more plain when you might miss something.

[1] https://arxiv.org/pdf/1602.01103 [2] https://www.ibm.com/watson/services/tone-analyzer/

clbr commented 6 years ago

@anadon It's not root access, but node access with read permissions that is required. It's required to do the VRAM queries on any kernel, and to do any queries on kernels that have strict /dev/mem io checks enabled. The issue here is that drmOpen doesn't find the node, and v10lator wants me to hardcode card0, instead of fixing drmOpen. Hardcoding is unacceptable as explained above, it would fail in many systems with multiple cards.

@V10lator No, I did not look at it, I'm busy. Now that you linked the direct lines, I see it parses card info from /sys. This is a valid approach and drmOpen could be replaced by it. However I don't have the time in the near future to do so, and as mentioned, I can't reproduce this issue with hardware I have. Patches welcome.

Hello71 commented 5 years ago

OK, I traced the code. it's very bad, but basically:

  1. find the cards by scanning /sys/class/drm. filter by ^card\d+$.
  2. open /dev/dri/renderDx with the number from step 1.
  3. if that doesn't work, open /dev/dri/cardx with the number from step 1.

there is probably a menu somewhere for picking the card.

k3a commented 5 years ago

So... what I've found so far:

Long story short, you should read comment at the top of https://elixir.bootlin.com/linux/v5.1/source/drivers/gpu/drm/drm_ioctl.c#L43 and open the device by name or directly as @Hello71 wrote.

Opening by name can be as simple as adding a new commandline option for specifying driver name or path to dri node. No need to make something fancy. Opening by BUSID is simply unsupported and it is sad that it has been 2 years since this bug is known and still not fixed. :(

clbr commented 5 years ago

trek00 contributed some amdgpu improvements. Please test current git.

k3a commented 5 years ago

I've tested the current master (09d8c0b15) on Arch Linux with kernel 5.2.0-arch2-1-ARCH and it works perfectly - VRAM is shown on my POLARIS11. Even under a non-root user. Good work!

anthonybilinski commented 3 years ago

Seconding @k3a, running a R9 390 on Ubuntu 20.10 with kernel 5.8.0-44-generic, I saw the same issue with the repo package at version 1.2-1. Compiling tip at v1.3-3-g1ed2440 gives me memory readings again.

@clbr I think this can be closed, thanks a lot for the tool.

azrabrijer commented 1 year ago

hello, the problem with displaying 0% load in the mangohud overlay has been preserved