amshafer / nvidia-driver

Fork of the Nvidia FreeBSD driver to port the nvidia-drm.ko module from Linux
43 stars 5 forks source link

[550.54.14] PRIME render offload stalls X startup #22

Open vishwin opened 4 months ago

vishwin commented 4 months ago

hw.nvidiadrm.modeset=1 set in /boot/loader.conf, kernel modules loaded from /etc/rc.conf. X startup stalls with the screen off after libinput initialises the last pointing device. Machine is otherwise responsive and X is able to be zapped.

550.54.14 Xorg.log

550.54.14 dmesg

535.146.02 dmesg (Xorg.0.log now missing :pensive:)

amshafer commented 4 months ago

One last little detail, what's the display setup for this look like? Just the laptop screen or is there an external monitor plugged in as well? When I try with an external monitor I hit the panic in #21, so I'm assuming you're not doing that.

Another thing to check would be that you have two cardN entries in /dev/dri/, but I'm assuming that's the case since it seems everything initializes correctly.

amshafer commented 4 months ago

Ah nvm, reproduced

amshafer commented 4 months ago

For whatever strange reason I can only reproduce this when I load nvidia-drm before amdgpu. Can you test and see if you see the same? Maybe by loading them manually just to verify, I don't know what order the rc.conf variable loads things in.

fwiw if I load amdgpu and then nvidia-drm it works fine.

vishwin commented 4 months ago

nvidia-drm has always been loaded after i915kms takes over the framebuffer from UEFI, as shown with the LinuxKPI I2C lines.

amshafer commented 4 months ago

One thing you can check while I keep looking at this is the contents of /usr/local/share/X11/xorg.conf.d/20-nvidia-drm-outputclass.conf and (if it exists) /usr/local/share/X11/xorg.conf.d/10-intel.conf:

root@:~ # cat /usr/local/share/X11/xorg.conf.d/20-nvidia-drm-outputclass.conf
Section "OutputClass"
    Identifier "nvidia"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "PrimaryGPU" "yes"
    ModulePath "/usr/local/lib/nvidia/xorg"
    ModulePath "/usr/local/lib/xorg/modules"
EndSection
root@:~ # cat /usr/local/share/X11/xorg.conf.d/10-intel.conf                 
Section "OutputClass"
    Identifier "intel"
    MatchDriver "i915"
    Driver "modesetting"
    Option "PrimaryGPU" "yes"
EndSection

This is a working config for me on my intel PRIME machine, I'm wondering if your setup switched when the .conf files were overwritten during the latest package update and set the NVIDIA gpu as the primary. In that case you would see the black screen until you ran xrandr --auto. Note that if you do that right now or use an external monitor you'll still hit the panic I'm looking into.

You should be able to force Intel as the primary by ensuring Option "PrimaryGPU" "yes" is in the intel.conf, which you might have to create as iirc by default it isn't installed by a package. Hopefully that helps

vishwin commented 4 months ago

I have all of the above in xorg.conf.d/ except for Option "PrimaryGPU" "yes" under intel and specifying the nvidia module paths. Leaving them out worked in 535.146.02. Don't have access to the machine for another couple days so will update when I get back.

vishwin commented 4 months ago

Setting Option "PrimaryGPU" "yes" under intel allows X to continue bringing the displays/screens up, but this effectively becomes an Intel-only setup, as if the nvidia modules were never loaded. All rendering, GL providers, etc are done by intel via Mesa.

In 535.146.02, I never had to run any xrandr command for the nvidia (headless) to handle rendering whilst intel handled display. On this version, when trying to execute the recommended xrandr commands at any point, with nvidia as PrimaryGPU:

% xrandr --setprovideroutputsource modesetting NVIDIA-0
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  140 (RANDR)
  Minor opcode of failed request:  35 (RRSetProviderOutputSource)
  Value in failed request:  0x217
  Serial number of failed request:  16
  Current serial number in output stream:  17
% xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x217 cap: 0x0 crtcs: 0 outputs: 0 associated providers: 0 name:NVIDIA-0
Provider 1: id: 0x241 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 3 outputs: 8 associated providers: 0 name:modesetting

Note that with intel as PrimaryGPU:

% xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x49 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 3 outputs: 8 associated providers: 0 name:modesetting
Provider 1: id: 0x2c7 cap: 0x0 crtcs: 0 outputs: 0 associated providers: 0 name:NVIDIA-G0
amshafer commented 4 months ago

Does it work with NVIDIA as the primary GPU if you run with xrandr --auto though? That's the missing bit for me, until I do that the laptop screen stays black. I don't know why that would suddenly be required again in 550, the logic for deciding this stuff in the X server can be wacky sometimes.

vishwin commented 4 months ago

xrandr --auto didn't do anything, so no.

amshafer commented 4 months ago

Okay so that's different to what I've seen then. Out of curiosity in PrimaryGPU intel mode does running things on the NVIDIA GPU through the prime env variables work? i.e. something like:

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep vendor
server glx vendor string: NVIDIA Corporation
client glx vendor string: NVIDIA Corporation
OpenGL vendor string: NVIDIA Corporation

Sorry for all the requests, since I don't reproduce exactly what you're seeing I'm just trying to figure out what's working.

vishwin commented 4 months ago

glxinfo with those environment variables worked. But of course I don't want to keep passing them.

amshafer commented 3 months ago

There are issues with the prebuilt nvidia-drm pkg, is that what you are using? Or are you building from ports? If you're not building from ports can you give that a try?

related: https://reviews.freebsd.org/D44308

vishwin commented 3 months ago

all only ever built from ports

vishwin commented 3 months ago

fwiw adding __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia to .xprofile as a test to forcibly mimic the old behaviour results in generally unusable rendered results. Even alacritty (GPU-accelerated terminal) results in a black (unrendered) window.

vishwin commented 3 months ago

D44308 allows X startup to continue and eventually return to the old behaviour from 535.146.02. However, rendering is a bit glitchy, occasionally showing the immediate previous frames, especially around the refresh rate such as watching high frame rate video or fast typing.

amshafer commented 3 months ago

Some progress is good. What desktop env/etc is this with? Also what drm-kmod version are you using?

vishwin commented 3 months ago

Latest -CURRENT so latest drm-61-kmod due to the API change. Desktop is Cinnamon, which I've been needing to update for time, especially recently as muffin has been sus.

amshafer commented 3 months ago

I still haven't been able to reproduce any of the misrendering issues which is odd. I'll have to give Cinnamon a try.

Can you include the conftest results from 535 and 550 if possible? Just to check that nothing obvious went wrong with the compatibility detection. Something like cat work/NVIDIA.../(nvidia for 535)/src/nvidia-drm/conftest/* should grab the function.h, type.h, etc that get generated during the build

vishwin commented 1 month ago

Finally back on the target machine; latest upstream Cinnamon (not in ports yet) still rendering glitchy with occasional falls off the bus. More pronounced with a multiple-screen setup. Let me see if I can get the conftest

amshafer commented 1 month ago

Wait so with 535 everything works fine (including no glitching) but with 550 it falls off the bus? That's very odd, usually falling off the bus is indicative of some kind of power issue? I'd double check that 535 doesn't also fall off the bus in order to confirm if there's a regression in 550.

Not to prematurely blame Cinnamon, but it would be interesting to see if your glitchy rendering happens on xfce4 as well. If xfce also shows the glitching and it doesn't happen with 535 I'd take that as confirmation that something is wrong with nvidia-drm.

vishwin commented 1 month ago

535 does not suffer from glitchy rendering but also falls off the bus occasionally. However, the glitchiness isn't really noticeable on a single screen setup, like just the laptop display, but is certainly pronounced with multiple screens like my laptop display + external monitor.

The falling off the bus seem to trigger randomly mostly on pure GTK programs, particularly simpler dialog box or settings-type stuff, as if it is struggling to render something that shouldn't need much effort to draw. Specifically, I've had it happen with scrolling through a settings dialog, clicking a button that I can't release because the GPU falls of the bus right there, but also just rendering a PDF/image preview in the file manager a couple times. Could have to do with compositing? I'm dubious about power issues as the GPU itself is headless and not exactly replaceable, and these have all happened whilst plugged in.

vishwin commented 1 month ago

Won't be able to properly test xfce until after returning from BSDCan and SELF mid-next month because the external monitor will not be available for those.

amshafer commented 1 month ago

535 does not suffer from glitchy rendering

Seems like I need to test with Cinnamon then. I don't think I've ever tried that before, although last time I looked into this issue it was with XFCE and I didn't see glitching there.

but is certainly pronounced with multiple screens like my laptop display + external monitor.

What is the glitching like? Color corruption or tearing or something else? Normally I'd say something like this is an issue with the compositor but since it doesn't happen on 535 it sounds like something triggered by nvidia-drm.

The falling off the bus still seems unrelated, and like I said really is normally something to do with power. Even if it's plugged in I think it normally still goes through the battery which can go bad, but you might be able to disable the battery completely and then test if your laptop bios allows it.

vishwin commented 1 month ago

Glitchiness not so much tearing (which I always expect), but rather to the effect of laggy refresh rate and momentary displays of previous frames. Most pronounced when viewing a 60 fps video on a 60 Hz refresh rate display.

I no longer have an internal battery so I disconnected the external battery, we'll see what happens.

vishwin commented 1 month ago

Just experienced a falling off the bus without the battery.

amshafer commented 1 month ago

Any ACPI or other power messages in dmesg before it falls off the bus?

The laggy frames does sound interesting, that could conceivably be explained by nvidia-drm. Last time I tried reproducing with simple programs, so I'll try with a fullscreen video.

vishwin commented 1 month ago

Any ACPI or other power messages in dmesg before it falls off the bus?

never