elFarto / nvidia-vaapi-driver

A VA-API implemention using NVIDIA's NVDEC
Other
1.16k stars 53 forks source link

Failure after suspend/resume? #253

Open bmartin427 opened 9 months ago

bmartin427 commented 9 months ago

I have acceleration working fine on my media PC, as long as I try it soon after boot. However I suspend this PC in between uses, and acceleration never works following such a cycle until I reboot. Every other GPU function I've tested continues working after the failure: OpenGL, VDPAU, etc are all fine. Hardware is a GeForce GT 1030, OS is Ubuntu 22.04, nvidia driver version is 535.113.01, and nvidia-vaapi-driver version is git 0a924c.

The first time I try running vainfo after a resume, I get:

$ NVD_LOG=1 NVD_BACKEND=egl vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4007.609815912 [1538-1538] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4007.609902233 [1538-1538] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4007.609961457 [1538-1538] ../src/vabackend.c:2203       __vaDriverInit_1_0 Selecting EGL backend
      4007.624392478 [1538-1538] ../src/export-buf.c: 132       findGPUIndexFromFd Defaulting to CUDA GPU ID 0. Use NVD_GPU to select a specific CUDA GPU
      4007.624415595 [1538-1538] ../src/export-buf.c: 149       findGPUIndexFromFd Looking for GPU index: 0
      4007.627540148 [1538-1538] ../src/export-buf.c: 161       findGPUIndexFromFd Found 3 EGL devices
      4007.628336459 [1538-1538] ../src/export-buf.c: 170       findGPUIndexFromFd Got EGL_CUDA_DEVICE_NV value '0' for EGLDevice 0
      4007.628348471 [1538-1538] ../src/export-buf.c: 191       findGPUIndexFromFd Selecting EGLDevice 0
      4007.630274926 [1538-1538] ../src/export-buf.c: 260         egl_initExporter Driver supports 16-bit surfaces
      4007.631365261 [1538-1538] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'unknown error' (999)

      4007.631377762 [1538-1538] ../src/export-buf.c:  61      egl_releaseExporter Releasing exporter, 0 outstanding frames
      4007.631391172 [1538-1538] ../src/export-buf.c:  78      egl_releaseExporter Done releasing frames
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed

Also, the following lines appear in dmesg during that first vainfo query:

[ 4007.631181] NVRM: GPU at PCI:0000:01:00: GPU-cd29aa0b-44a2-8266-14a3-1f03d08167a1
[ 4007.631188] NVRM: Xid (PCI:0000:01:00): 31, pid=538, name=modprobe, Ch 00000002, intr 10000000. MMU Fault: ENGINE HOST6 HUBCLIENT_HOST faulted @ 0x1_01011000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Subsequent calls to vainfo produce no more dmesg output, and the console output changes somewhat:

$ NVD_LOG=1 NVD_BACKEND=egl vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
      4812.162229425 [2037-2037] ../src/vabackend.c: 138                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
      4812.162304641 [2037-2037] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4812.162318470 [2037-2037] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4812.162330552 [2037-2037] ../src/vabackend.c:2203       __vaDriverInit_1_0 Selecting EGL backend
      4812.175101148 [2037-2037] ../src/export-buf.c: 132       findGPUIndexFromFd Defaulting to CUDA GPU ID 0. Use NVD_GPU to select a specific CUDA GPU
      4812.175124754 [2037-2037] ../src/export-buf.c: 149       findGPUIndexFromFd Looking for GPU index: 0
      4812.178137619 [2037-2037] ../src/export-buf.c: 161       findGPUIndexFromFd Found 3 EGL devices
      4812.180277494 [2037-2037] ../src/export-buf.c: 196       findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 0
      4812.180296001 [2037-2037] ../src/export-buf.c: 196       findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 1
      4812.180308433 [2037-2037] ../src/export-buf.c: 199       findGPUIndexFromFd No DRM device file for EGLDevice 2
      4812.180317372 [2037-2037] ../src/export-buf.c: 202       findGPUIndexFromFd No match found, falling back to default device
      4812.180326521 [2037-2037] ../src/vabackend.c:2231       __vaDriverInit_1_0 Exporter failed
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

I have tried direct backend instead of egl, and get no different results, aside from some slightly different error text.

I'm not 100% certain the suspend and resume is the cause. I have attempted a quick suspend/resume cycle in order to troubleshoot this problem and been unable to reproduce; but it always happens if I leave it suspended for a normal amount of time (hours). So possibly something else about the elapsed time is involved.

I also have tried to leave firefox running during a suspend/resume, thinking that acceleration might continue to function if I just didn't have to repeat the initialization process, however firefox seems to explode immediately upon resume, so this is not an option.

bmartin427 commented 9 months ago

For reference here's a session using the direct backend. The first query was before a suspend/resume, the latter two were after.

brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4089.149695354 [3287-3287] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4089.149724484 [3287-3287] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4089.149746525 [3287-3287] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4089.163510502 [3287-3287] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4089.163532980 [3287-3287] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4089.163541389 [3287-3287] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4089.163612291 [3287-3287] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.14 (libva 2.12.0)
vainfo: Driver version: VA-API NVDEC driver [direct backend]
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain12             : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointVLD
      4089.308220963 [3287-3287] ../src/vabackend.c:2081              nvTerminate Terminating 0x55933e7e4d40
      4089.308325527 [3287-3287] ../src/vabackend.c:2095              nvTerminate Now have 0 (0 max) instances
brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4221.457787068 [3540-3540] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4221.457808648 [3540-3540] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4221.457820940 [3540-3540] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4221.472699819 [3540-3540] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4221.472724892 [3540-3540] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4221.472737114 [3540-3540] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4221.472851581 [3540-3540] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
      4221.474599881 [3540-3540] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'unknown error' (999)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit
brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
      4226.566012274 [3543-3543] ../src/vabackend.c: 138                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
      4226.566085396 [3543-3543] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4226.566098805 [3543-3543] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4226.566110469 [3543-3543] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4226.578729192 [3543-3543] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4226.578750354 [3543-3543] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4226.578759782 [3543-3543] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4226.578826339 [3543-3543] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
      4226.578960222 [3543-3543] ../src/direct/direct-export-buf.c:  23       findGPUIndexFromFd CUDA ERROR 'initialization error' (3)

      4226.578971746 [3543-3543] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'initialization error' (3)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

I also have the same two dmesg lines as before.

rcoacci commented 9 months ago

I'm seeing something related to this, but in my case Firefox crashes upon resuming. I've just disabled nvidia-vaapi-driver completely and will see if the crashes continue. I've tried setting up NVIDIA's PreserveVideoMemoryAllocations, also but it made gnome-shell become impossible to use after resume (which is even worse...)

elFarto commented 9 months ago

Unfortunately this is an issue with the NVIDIA driver, and there's not much I can do about it. The driver really doesn't like having any sort of NVDEC context that's left active over the suspend/resume causes it to break the driver until a reboot is done.

bmartin427 commented 9 months ago

Hmm. If firefox is closed before I suspend, then is there anything else I can do to prevent NVDEC context from being left active? Is there something else I need to explicitly kill, or is it really just that I've ever used it at all?

hhfeuer commented 8 months ago

Know issue of the nvidia driver. After suspend/resume, the nvidia-uvm module is defunct even if not used. The workaround being unloading/reloading it.

mikejaques commented 8 months ago

Can confirm this. I wrote up a specific "how to" for Pop!_OS users just yesterday, but after resume from suspend HW acceleration in Firefox is broken. Only a reboot fixes it. I haven't tried unloading/reloading but that's not really a solution for the average user.

Question, it's a "known issue" with the NVIDIA driver, but is there any actual confirmation or bug tracking within NVIDIA as a company? Does this bug affect Wayland or only X11 windowing systems? I ask that because, and I'm only moderately knowledgeable about Linux with nearly ZERO experience with Wayland, so I don't know if Wayland even requires a vaapi layer for hardware acceleration of video codecs.

elFarto commented 7 months ago

I'm not sure if there's an actual NVIDIA bug for it. I've bumped the issue[1] in the NVIDIA forums and we'll see if we get a response.

[1] https://forums.developer.nvidia.com/t/xid-31-after-wakeup-from-sleep/139870/6

MageSlayer commented 6 months ago

Having the same issue under laptop in secondary nvidia card in PRIME configuration. Hardware acceleration fails after resume from suspend.

$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.20.0
libva error: vaGetDriverNames() failed with unknown libva error
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
    135775.283643377 [30120-30120] ../src/vabackend.c: 130                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
    135775.283662988 [30120-30120] ../src/vabackend.c:2145       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
    135775.283665133 [30120-30120] ../src/vabackend.c:2154       __vaDriverInit_1_0 Now have 0 (0 max) instances
    135775.283667649 [30120-30120] ../src/vabackend.c:2180       __vaDriverInit_1_0 Selecting Direct backend
    135775.286633777 [30120-30120] ../src/backend-common.c:  31            isNvidiaDrmFd Invalid driver for DRM device: i915
    135775.286665005 [30120-30120] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD129
    135775.286668121 [30120-30120] ../src/direct/nv-driver.c: 246            init_nvdriver Initing nvdriver...
    135775.286683125 [30120-30120] ../src/direct/nv-driver.c: 264            init_nvdriver NVIDIA kernel driver version: , major version: 0, minor version: 0
    135775.286685882 [30120-30120] ../src/direct/nv-driver.c: 271            init_nvdriver Got dev info: 100 1 2 6
    135775.286771896 [30120-30120] ../src/direct/direct-export-buf.c:  23       findGPUIndexFromFd CUDA ERROR 'initialization error' (3)

    135775.286774654 [30120-30120] ../src/vabackend.c:2210       __vaDriverInit_1_0 CUDA ERROR 'initialization error' (3)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

Doing nvidia-uvm reloading solves the issue:

# rmmod nvidia-uvm
# modprobe nvidia-uvm
mirh commented 6 months ago

Aren't standby problems related to the stuff discussed in #182? And isn't it all fixed in 545+?

MageSlayer commented 6 months ago

Last time I tried some 535 driver, it refused to decrease cooler speed after some video playback. My laptop sounded like a jet-plane & never stopped unless rebooted.

I'll try 545 this time. Thanks for suggestion.

MageSlayer commented 6 months ago

I checked 545.23.08 version and looks like they've fixed both cooler speed & hw acceleration after suspend/resume issues.

I think the issue might be closed now.

MageSlayer commented 5 months ago

I checked 545.23.08 version and looks like they've fixed both cooler speed & hw acceleration after suspend/resume issues.

I think the issue might be closed now.

Looks like I was too quick. The suspend/resume hw acceleration bug is still there in driver 545.23.08. vainfo emits error & Firefox acceleration is missing after 3-4th resume from suspend.

strahe commented 2 months ago

This bug is still there in driver 550.78

strahe commented 1 month ago

I am using Archlinux, the instructions here solved my problem, I hope it will be useful to you.