Crash with Native GPU Memory Buffers

lorenz commented 5 years ago

When running Chromium on a compositor supporting some pixel formats for GPU Memory Buffers (for example Weston) it fails to start with a segmentation fault in GbmPixmapWayland::InitializeBuffer(). GDB is not that useful since this is a release build. It looks like that's either caused by size or gbm_device() being null.

msisov commented 5 years ago

what's your system? does it work without native gpu memory buffers used?

lorenz commented 5 years ago

Ubuntu 18.10 with Mesa 18.2 on an RX570. GBM is definitely supported. And yes, it does work without.

msisov commented 5 years ago

It looks more like inconsistency between buffer usage and buffer type. I have RX460 running on Debian stretch with backports and Mesa 18.2, which uses native gpu memory buffers without any problems.

Would it be possible for you to add symbol_level=2 and recompile Chromium to see the bt?

lorenz commented 5 years ago

Running a build for symbol_level 2 now

lorenz commented 5 years ago

So, I still don't have line-by-line symbols, but by examining the registers and the assembly I'm 99% sure that gbm_device is null. I have no idea why though.

msisov commented 5 years ago

any news?

lorenz commented 5 years ago

Not really, still happens and I don't know why. I aborted the symbol_level=2 build after it consumed ~32GiB RAM and 20TiB of disk IO.

msisov commented 5 years ago

will it be possible to add checks along the path and see if the device is really null?

nickdiego commented 5 years ago

Not really, still happens and I don't know why. I aborted the symbol_level=2 build after it consumed ~32GiB RAM and 20TiB of disk IO.

Hi @lorenz , Could you please try adding the following to your args.gn:

enable_nacl = false
ozone_auto_platforms = false
use_ozone = true
use_xkbcommon = true
ozone_platform_wayland = true
is_debug = false
remove_webcore_debug_symbols = true
symbol_level = 1
dcheck_always_on = true

This should be enought to get a stack trace in case of crash.

Additionally, are you using use_system_minigbm=true ? If yes, could you try with use_radeon_minigbm=true instead, and check whether the result is the same?

lorenz commented 5 years ago

@nickdiego I'm currently running a build with the args you suggested. Thanks for providing these, it's a bit hard to navigate all the build flags as a non-Chromium-dev. When the build finishes I'll report back.

lorenz commented 5 years ago

Build did complete, but the bug still persists. I get a null pointer segfault at gbm_pixmap_wayland.cc:69. This is using your build args and use_radeon_minigbm=true. The crash only happens when I'm force-enabling native GPU memory buffers.

lorenz commented 5 years ago

@nickdiego I instrumented the critical part and got this: [8632:8660:0202/173611.837021:ERROR:gbm_pixmap_wayland.cc(71)] connection_->gbm_device() is null

EDIT: I started adding debug output to the section where gbm_device is initialized and figured out that the issue is that --in-process-gpu disables the branch !args.single_process at InitializeGPU() and thus never initialized the GBM device. When I don't pass that argument I get Failed to initialize gbm device. I'm still trying to figure out why that fails.

msisov commented 5 years ago

It cannot be null. Otherwise, you won’t be able to start browser at all. Chromium tried to create gbm bo with a buffer type not supported on your device. Can you copy/paste the about://gpu page here?

PS you can’t use native gpu memory buffers with —in-process-gpu flag. The feature is not used for the in-process-gpu at the moment. And won’t be used in the future, I guess.

That path uses egl surfaces instead. That means gbm is not needed.

Though, we might make in-process-gpu to work with gbm as well, and allow native gpu memory buffers then. But if gbm is not available, switch back to egl surfaces instead.

PPS how do you start browser? What flags do you exactly pass,

lorenz commented 5 years ago

It is definitely null and the browser starts if not using native gpu buffers. I added a log there which literally checks connection_->gbm_device() == nullptr. I figured out that --in-process-gpu doesn't work with native gpu memory buffers. But when not passing that flag I get Failed to initialize gbm device.

msisov commented 5 years ago

I’ve already told you why it happens. Don’t pass —in-process-gpu flag if you want to use native gpu memory buffers

msisov commented 5 years ago

Is there a specific reason why you don’t want to have a separate gpu process running?

To sum up, gbm is used only without in-process-gpu flag aka a separate gpu process is spawned. Native gpu memory buffers feature heavily relies on that.

Likely, we could always use gbm, which would allow native gpu memory buffers work with in-process-gpu mode. And if gbm is not available, forbid that feature and fall back egl surfaces (again, they are used instead of gbm with in-process-gpu). I don’t think egl surfaces can be used with native gpu memory feature as long as it was initially made for drm and requires native pixmap based on drm planes and etc.

lorenz commented 5 years ago

Some more info: I switched to use_system_gbm=true, now I no longer get a GBM initialization error. But I get this (GBM device available is injected by me and is printed after set_gbm_device() in InitalizeGPU):

[22018:22018:0202/193557.532174:ERROR:ozone_platform_wayland.cc(185)] GBM device available
[22018:22018:0202/193557.549188:ERROR:sandbox_linux.cc(364)] InitializeSandbox() called with multiple threads in process gpu-process.
[21990:22001:0202/193557.704705:ERROR:browser_child_process_host_impl.cc(430)] Terminating child process for bad IPC message: Number of strides(1)/offsets(1)/modifiers(0) does not correspond to the number of planes(1)
[1:1:0202/193557.743314:ERROR:command_buffer_proxy_impl.cc(106)] ContextResult::kTransientFailure: Shared memory region is not valid
[1:1:0202/193557.743387:ERROR:context_provider_command_buffer.cc(143)] GpuChannelHost failed to create command buffer.
[22195:22195:0202/193557.758242:ERROR:ozone_platform_wayland.cc(185)] GBM device available
[22195:22195:0202/193557.773503:ERROR:sandbox_linux.cc(364)] InitializeSandbox() called with multiple threads in process gpu-process.
[21990:22001:0202/193557.791516:ERROR:browser_child_process_host_impl.cc(430)] Terminating child process for bad IPC message: Number of strides(1)/offsets(1)/modifiers(0) does not correspond to the number of planes(1)
[21990:21990:0202/193557.710611:ERROR:wayland_connection_connector.cc(49)] Not implemented reached in virtual void ui::WaylandConnectionConnector::OnChannelDestroyed(int)
[21990:21990:0202/193557.798019:ERROR:wayland_connection_connector.cc(49)] Not implemented reached in virtual void ui::WaylandConnectionConnector::OnChannelDestroyed(int)
[22247:22247:0202/193557.838304:ERROR:ozone_platform_wayland.cc(185)] GBM device available
[22247:22247:0202/193557.851483:ERROR:sandbox_linux.cc(364)] InitializeSandbox() called with multiple threads in process gpu-process.
[22247:22247:0202/193557.853999:FATAL:wayland_connection_proxy.cc(68)] Check failed: wc_ptr_. 
#0 0x55fc1f058519 base::debug::CollectStackTrace()
#1 0x55fc1ef847b3 base::debug::StackTrace::StackTrace()
#2 0x55fc1ef9d8fa logging::LogMessage::~LogMessage()
#3 0x55fc1c3f5ac3 ui::WaylandConnectionProxy::CreateZwpLinuxDmabufInternal()
#4 0x55fc1c3f62dc base::internal::Invoker<>::RunOnce()
#5 0x55fc1efa7219 base::debug::TaskAnnotator::RunTask()
#6 0x55fc1effffff base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWorkImpl()
#7 0x55fc1f0004f4 base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork()
#8 0x55fc1efa8f4a base::MessagePumpDefault::Run()
#9 0x55fc1f000899 base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run()
#10 0x55fc1efd0020 base::RunLoop::Run()
#11 0x55fc23713114 content::GpuMain()
#12 0x55fc1eaec29e content::ContentMainRunnerImpl::Run()
#13 0x55fc1eb1f6a6 service_manager::Main()
#14 0x55fc1eaea491 content::ContentMain()
#15 0x55fc1b95b1b3 ChromeMain
#16 0x7fd502b4109b __libc_start_main
#17 0x55fc1b95b02a _start

I dropped a bunch of entries related to unimplemented funcitons that don't seem to matter.

msisov commented 5 years ago

Ok, your gbm implementation doesn’t provide modifiers field. That’s why browser process terminates gpu process (we have a validation method in the gpu process side, which checks if passed information is not compromised. That literally means that containers with strides, modifiers and offsets must have the same size.

We didn’t see that kind a problem before, but it seems your system is an exception. I’ll let you know once it fixed. Most likely, that field is going to be removed from the check.

lorenz commented 5 years ago

Makes sense. I'm pretty sure that issue is pretty common when using system GBM since Ubuntu pretty much uses unmodified upstream Mesa 18.2, which is what everybody else also uses. The thing is that all the other minigbm implementations don't work on my system.

msisov commented 5 years ago

It’s about gpu driver and dri/drm rather than gbm. Gbm is just generic buffer manager, which abstracts everything underneath.

In any case, the root cause is clear now and will be fixed ASAP

msisov commented 5 years ago

the issue has been moved to upstream https://bugs.chromium.org/p/chromium/issues/detail?id=928261

msisov commented 5 years ago

fixed by https://github.com/Igalia/chromium/pull/528

Igalia / chromium

Crash with Native GPU Memory Buffers #508