NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.24k stars 1.29k forks source link

Chromium GPU Process Cannot Start #644

Open pravinxor opened 6 months ago

pravinxor commented 6 months ago

NVIDIA Open GPU Kernel Modules Version

555.42.02

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Arch Linux

Kernel Release

Linux 6.9.1-hardened1-1-hardened #1 SMP PREEMPT_DYNAMIC Mon, 20 May 2024 12:54:08 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 4060 Laptop GPU (UUID: GPU-57e1b957-4845-a325-50fb-12cb069295cd)

Describe the bug

When starting Chromium (or any chromium based program) using the --ozone-platform=wayland flag, the GPU process for Chromium cannot start, thus causing hardware acceleration to be completely unavailable- even if the browser is not tasked with performing the hardware acceleration on the Nvidia GPU.

Relevant parts of the Chromium event log:

[1182:1182:0521/234950.186727:ERROR:gl_display.cc(520)] : EGL Driver message (Critical) : eglCreateImage failed with 0x00003003
[1182:1182:0521/234950.186814:ERROR:scoped_egl_image.cc(23)] : Failed to create EGLImage: EGL_BAD_ALLOC
[1182:1182:0521/234950.187022:ERROR:native_pixmap_egl_binding.cc(113)] : Unable to initialize binding from pixmap
[1182:1182:0521/234950.187082:ERROR:ozone_image_backing.cc(365)] : OzoneImageBacking::ProduceSkiaGanesh failed to create GL representation
[1182:1182:0521/234950.187126:ERROR:shared_image_manager.cc(232)] : SharedImageManager::ProduceSkia: Trying to produce a Skia representation from an incompatible backing: OzoneImageBacking
GpuProcessHost: The GPU process exited with code 8704.

To Reproduce

Note: hardware acceleration is active and performs correctly when Chromium is running via XWayland

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

mtijanic commented 6 months ago

Hi there. Are you certain about this bit:

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver. [x] I confirm that this does not happen with the proprietary driver package.

I know it's easier to just tick that box than to report the bug to linux-bugs@nvidia.com or on the forums, but what you are effectively saying is that the bug is in the kernel modules (plausible) and that it is in the delta between Open and Proprietary. That delta in 555.xx is very very tiny, so I find it extremely unlikely. Please double-check, otherwise kernel engineers who monitor this tracker (which is for kernel module issues only) could waste time looking in the wrong place.

PS, it seems like in your testing you installed the old kernel module, but still kept the new userspace. This can cause all sorts of issues, so best get that fixed:

May 21 23:39:19 zephyrus kernel: NVRM: API mismatch: the client has the version 555.42.02, but
                                 NVRM: this kernel module has the version 550.78.  Please
                                 NVRM: make sure that this kernel module and all NVIDIA driver
                                 NVRM: components have the same version.
pravinxor commented 6 months ago

Thanks for getting back, sorry about the mismatch between the userspace and kernel drivers- I've sorted that out, so that they're both on the same version, however the error still occurs. As for whether this is specific to the open kernel modules, I can confirm that the proprietary does work correctly. I've attached 2 sets of log files (open and proprietary kernel modules). Each set includes an nvidia bug report log, as well as a report from chromium. I'm happy to provide other information or perform debugging as well, if you believe it could help. about-gpu-open.txt about-gpu-proprietary.txt nvidia-bug-report-open.log.gz nvidia-bug-report-proprietary.log.gz

mtijanic commented 6 months ago

Thanks for double-checking. That is very surprising to me, I don't see anything in the logs suggesting any meaningful difference (except maybe some external monitor unplugging - was the test for both with the same monitors attached).

We'll try to repro this internally. It's very concerning that there's a functional difference here. Thanks!

pravinxor commented 6 months ago

Between the two tests I most recently posted, the display configuration was exactly the same. However between the recent two tests and the first test I posted, one of the attached displays was different. Though, I don't believe this is a significant factor, since the issue occurs regardless of the displays configuration.

pravinxor commented 4 months ago

I just wanted to update this thread with a small change that has happened between then and now. The log messages from EGL appear a little different.

about-gpu-2024-06-26T18-24-54-455Z.txt nvidia-bug-report.log.gz

Hanssen0 commented 3 weeks ago

Adding some information that may be helpful, I reproduced with the proprietary driver

nvidia-dkms 560.35.03-18

Arch Linux

Linux Hanssen-Linux 6.11.5-arch1-1-g14 #1 SMP PREEMPT_DYNAMIC Sun, 27 Oct 2024 17:01:27 +0000 x86_64 GNU/Linux

GPU 0: NVIDIA GeForce RTX 4060 Laptop GPU (UUID: GPU-5737ef92-c92d-bbb1-c337-b01b9b5e7640)

[12784:12784:1028/033113.573410:ERROR:angle_platform_impl.cc(44)] ImageEGL.cpp:112 (operator()): eglCreateImage failed with 0x00003003
ERR: ImageEGL.cpp:112 (operator()): eglCreateImage failed with 0x00003003
[12784:12784:1028/033113.573596:ERROR:scoped_egl_image.cc(23)] Failed to create EGLImage: EGL_SUCCESS
[12784:12784:1028/033113.573804:ERROR:native_pixmap_egl_binding.cc(118)] Unable to initialize binding from pixmap
[12784:12784:1028/033113.573941:ERROR:ozone_image_backing.cc(309)] OzoneImageBacking::ProduceSkiaGanesh failed to create GL representation
[12784:12784:1028/033113.574008:ERROR:shared_image_manager.cc(255)] SharedImageManager::ProduceSkia: Trying to produce a Skia representation from an incompatible backing: OzoneImageBacking
[12784:12784:1028/033113.574139:ERROR:gpu_service_impl.cc(1161)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
[12735:12782:1028/033113.584699:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.584732:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12735:12782:1028/033113.585115:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.585124:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12735:12782:1028/033113.586152:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.586160:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12735:12782:1028/033113.586377:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.586381:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12735:12735:1028/033113.590044:ERROR:gpu_process_host.cc(982)] GPU process exited unexpectedly: exit_code=8704
[12735:12782:1028/033113.594426:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.594443:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12735:12782:1028/033113.594460:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.594462:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12735:12782:1028/033113.594809:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.594818:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12735:12782:1028/033113.594833:ERROR:shared_image_interface_proxy.cc(134)] Buffer handle is null. Not creating a mailbox from it.
[12735:12782:1028/033113.594835:ERROR:one_copy_raster_buffer_provider.cc(348)] Creation of StagingBuffer's SharedImage failed.
[12889:10:1028/033113.636464:ERROR:command_buffer_proxy_impl.cc(131)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer.
[12895:10:1028/033113.719569:ERROR:command_buffer_proxy_impl.cc(131)] ContextResult::kTransi
entFailure: Failed to send GpuControl.CreateCommandBuffer.

nvidia-bug-report.log.gz

Hanssen0 commented 3 weeks ago

I reproduced with the proprietary driver

Same on the open driver.

nvidia-open-dkms 560.35.03-18

nvidia-bug-report-open.log.gz

Hanssen0 commented 3 weeks ago

For anyone suffering from this problem and reaching here, I bypassed this problem by adding --disable-gpu-compositing flag to Chrome as a temporary solution.

Check Hardware acceleration in electron apps on nvidia doesn't work for more information.