FreeBSDDesktop / kms-drm

the DRM part of the linuxkpi-based KMS
63 stars 26 forks source link

[amdgpu] ioctl locking issue (test case: Wayland-EGL-Firefox + Reddit new design) #133

Closed valpackett closed 3 years ago

valpackett commented 5 years ago

There's a weird locking bug in amdgpu somewhere.

I'm using Firefox Nightly (currently built 2019-02-09) on Wayland (natively) with GL acceleration enabled (just GL layer compositing as WebRender+Wayland is incomplete right now). (Also: Mesa 19.0.0-dev as of 2019-02-02, Wayfire as of today, kernel as of yesterday.)

When browsing Reddit redesign (non-old.) in this browser, some weird error happens, with dmesg lines like:

amdgpu_vm_validate_pt_bos() failed.
[drm:amdgpu_ih_process] [drm:amdgpu_cs_ioctl] Failed to process the buffer list -22!
[TTM] Buffer eviction failed
drmn0: failed to get a new IB (-22)

22 is EINVAL, and invalid calls from userspace are okay, but something is happening with locking when that happens.

On a NODEBUG kernel, this results in:

I finally decided to run a debug kernel and found some more information. The debug panic is:

panic: userret: Returning with 1 locks held

For some reason ddb thinks Firefox is unmounting something??

--- syscall (22, FreeBSD ELF64, sys_unmount), rip = 0x8013bc0da, rsp = 0x7fffdd8f9d68, rbp = 0x7fffdd8f9d90 ---

But that's clearly not true. Digging around with kgdb shows that it's ioctl(AMDGPU_CS) (44). Another similar crash I got today seems to be ioctl(DRM_RES_CTX) (38 / 0x26). (UPD: another 38, now on YouTube)

Maybe this is the same as #98, maybe there's two different places, but something in amdgpu (probably in error handling code) is not unlocking some important lock.

I see one BSDFIXME in amdgpu_cs.c, and interestingly it mentions a mutex struct:

    /* BSDFIXME: On FreeBSD we don't store the ww_acquire_ctx in the ww_mutex struct */
    /* Double check that the BO is reserved by this CS */