Corruption on NVIDIA - Githubissues

sfjohnson commented 10 months ago

Hello,

Thank you for this library. I've been having some issues on NVIDIA. I'm using the latest proprietary driver version on Arch, 545.29.06.

First thing is I believe there is currently a bug in the driver where it won't accept any flags for gbm_surface_create. So changing this to 0 fixes an ENOSYS error. I believe it's the same issue as here.

Once that is sorted, the issue is quite strange. The shader starts off rendering at 60 FPS and then after a few seconds the image starts to corrupt, the CPU usage rises and the framerate drops. The corruption looks like the pixels are drawn in the wrong order, or a bit like heavy video compression.

I've set the srm-basic example to render for a few seconds and then cleanup gracefully, but the issue persists across launches of srm-basic. It seems to be related to EGL/GBM as I ran this code where the CPU writes directly to the framebuffer and it works fine.

It appears that GBM_BO_USE_SCANOUT is always set on gbm_surface_create, so it's possible my issue is caused by GBM_BO_USE_RENDERING no longer being set.

I'm not having much luck finding documentation on the GBM API so if you have any information that would be appreciated.

ehopperdietzel commented 10 months ago

Hello, thanks for letting me know. Could you please send me the output generated by drm_info, srm-display-info, and srm-all-connectors?

$ drm_info > drm_info_log.txt 2>&1
$ srm-display-info > srm_display_info_log.txt 2>&1
$ SRM_DEBUG=4 SRM_EGL_DEBUG=4 srm-all-connectors > srm_all_connectors_log.txt 2>&1

I believe the FPS drop and glitches could be attributed to the exclusive support for the atomic DRM API (I had a similar issue with an Nvidia card and proprietary drivers some time ago), and I suspect there might be a call to the legacy API that could be causing this. I will check this out.

sfjohnson commented 10 months ago

For more context it's an Apple Studio Display connected to a PC by a DP to USB-C reverse cable. I'm also happy to test anything else on my hardware if you need.

ehopperdietzel commented 10 months ago

Thank you, I've already downloaded them. I'm going to see what might be happening.

ehopperdietzel commented 10 months ago

I ran several tests on my Nvidia card yesterday, and I'm experiencing the same issue. Interestingly, it didn't occur before, possibly due to a kernel or nvidia-drm version change. When using only dumb buffers, everything is fine, but with OpenGL/EGL, it starts at 60 fps and drops to 15 fps after a few seconds, and I don't quite understand why. I also tested kmscube, and the same issue occurs, so I suspect it's a driver issue. Could you try kmscube and see if it works well for you? I have another suspicion that when using OpenGL, it might be using a software-based rendering backend, swrast.so. Perhaps there's another backend that supports acceleration that can be installed. I'll continue investigating.

sfjohnson commented 10 months ago

I tried kmscube and it doesn't run, I'm getting "Invalid argument" on drmModeAddFB2(). I also tried removing flags from the gbm_bo_create() and gbm_surface_create() calls. Definitely some bad stuff going on with the latest NVIDIA driver.

I added a shader to srm-basic and it's putting a significant load on the CPU, so next I might profile it to see if it's jumping into the software renderer as you say.

sfjohnson commented 10 months ago

I ran perf and I don't see swrast.so, but it looks like something is calling sched_yield() too much:

     8.86%  srm-basic     [kernel.vmlinux]                [k] pick_next_task_fair
     6.17%  srm-basic     [vdso]                          [.] __vdso_clock_gettime
     4.41%  srm-basic     [kernel.vmlinux]                [k] __schedule
     4.35%  srm-basic     [kernel.vmlinux]                [k] __update_curr
     3.85%  srm-basic     [kernel.vmlinux]                [k] do_sched_yield
     3.60%  srm-basic     [kernel.vmlinux]                [k] psi_account_irqtime
     3.50%  srm-basic     [kernel.vmlinux]                [k] srso_alias_return_thunk
     3.20%  srm-basic     [kernel.vmlinux]                [k] entry_SYSCALL_64
     2.77%  srm-basic     [kernel.vmlinux]                [k] srso_alias_safe_ret
     2.65%  srm-basic     [kernel.vmlinux]                [k] __pick_eevdf
     2.54%  srm-basic     [kernel.vmlinux]                [k] pick_next_entity.isra.0
     2.33%  srm-basic     [kernel.vmlinux]                [k] __cgroup_account_cputime
     2.32%  srm-basic     [kernel.vmlinux]                [k] raw_spin_rq_lock_nested
     2.31%  srm-basic     [kernel.vmlinux]                [k] preempt_count_add
     2.14%  srm-basic     [kernel.vmlinux]                [k] syscall_exit_to_user_mode
     2.06%  srm-basic     [kernel.vmlinux]                [k] do_syscall_64
     1.87%  srm-basic     [kernel.vmlinux]                [k] rcu_note_context_switch
     1.81%  srm-basic     [kernel.vmlinux]                [k] schedule
     1.47%  srm-basic     [kernel.vmlinux]                [k] update_rq_clock
     1.44%  srm-basic     [kernel.vmlinux]                [k] _raw_spin_lock
     1.42%  srm-basic     [kernel.vmlinux]                [k] sched_clock_cpu
     1.38%  srm-basic     [kernel.vmlinux]                [k] native_sched_clock
     1.32%  srm-basic     libc.so.6                       [.] __sched_yield
     1.16%  srm-basic     [kernel.vmlinux]                [k] preempt_count_sub
     1.15%  srm-basic     [kernel.vmlinux]                [k] sched_clock
     1.10%  srm-basic     [kernel.vmlinux]                [k] _raw_spin_unlock
     1.07%  srm-basic     libnvidia-eglcore.so.545.29.06  [.] 0x0000000000afe6d7

ehopperdietzel commented 9 months ago

Interesting, and when you run srm-all-connectors, do you see the pixelated texture in the background and the white square cursor plane moving? If allocation through GBM fails, it should fallback to OpenGL. I believe I'm temporarily giving up with this for now. I've tried everything to understand what's happening, but with no luck. I also noticed that even with dumb buffers, the FPS drops after a few seconds if I write too many times in the mapped buffer. If I write a few times, the FPS never drops. The curious thing is that the writing time of the dumb buffers increases, but so does the time when vblank events are emitted. Hence, I suspect that any interaction with the driver is likely slowed down by some internal bug. For now, my only recommendation is to use nouveau, which apparently works quite well. In any case, if I manage to solve this issue, I'll keep you informed here.

sfjohnson commented 9 months ago

srm-all-connectors did show the background and cursor as you describe.

Thanks for your investigation. My software uses both OpenGL and CUDA so I believe I will still need the proprietary driver. Hopefully we will get a new driver version from NVIDIA soon and we can re-test.

CuarzoSoftware / SRM

Corruption on NVIDIA #10