amshafer / nvidia-driver

Fork of the Nvidia FreeBSD driver to port the nvidia-drm.ko module from Linux
43 stars 5 forks source link

kernel panic with Xorg #7

Open therontarigo opened 1 year ago

therontarigo commented 1 year ago

Kernel: 13.1-RELEASE-p2 Hardware: GTX 960M, Intel HD 530 (SKL GT2) drm-510-kmod: built from ports 25bd187bcf5e - the port uses GH_TAGNAME drm_v5.10.113_9, which is identical to branch 5.10-lts. (with MAKE_ENV+=DEBUG_FLAGS=-g) amshafer/nvidia-driver/nvidia built with make DRMKMODDIR=/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/

kldload src/nvidia/nvidia.ko kldload src/nvidia-modeset/nvidia-modeset.ko

start Xorg, test env __GLX_VENDOR_LIBRARY_NAME=nvidia glxgears -info -> works, uses Nvidia

quit Xorg, kldload src/nvidia-drm/nvidia-drm.ko start Xorg, test env __GLX_VENDOR_LIBRARY_NAME=nvidia glxgears -info -> kernel panic Same panic results when testing Vulkan such as vkcube-xlib

/var/crash/core.txt relevant excerpt

Unread portion of the kernel message buffer:
[drm ERROR :__nv_drm_gem_nvkms_memory_prime_get_sg_table] [nvidia-drm] [GPU ID 0x00000100] Cannot create sg_table for NvKmsKapiMemory 0x0xfffff8038b2c1a08

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x0
fault code      = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff82ff6c64
stack pointer           = 0x28:0xfffffe0143b176e0
frame pointer           = 0x28:0xfffffe0143b17700
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 3
current process     = 274 (MainThread)
trap number     = 12
panic: page fault
cpuid = 1
time = 1672376070

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55      __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80ba952c in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:487
#3  0xffffffff80ba999e in vpanic (fmt=0xffffffff81141708 "%s", 
    ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4  0xffffffff80ba97a3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:844
#5  0xffffffff8103ddf5 in trap_fatal (frame=0xfffffe0143b17620, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:944
#6  0xffffffff8103de4f in trap_pfault (frame=0xfffffe0143b17620, 
    usermode=false, signo=<optimized out>, ucode=<optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:763
#7  <signal handler called>
#8  dma_map_sgtable (dev=0xfffff8000627f000, sgt=0x0, dir=DMA_BIDIRECTIONAL, 
    attrs=<optimized out>)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/linuxkpi/bsd/include/linux/dma-mapping.h:25
#9  drm_gem_map_dma_buf (attach=0xfffff8038b387a00, dir=DMA_BIDIRECTIONAL)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/drm_prime.c:647
#10 0xffffffff8302eb2e in dma_buf_map_attachment (dba=0xfffff8038b387a00, 
    dir=dir@entry=DMA_BIDIRECTIONAL)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/dma-buf/dma-buf.c:510
#11 0xffffffff82f5819a in i915_gem_object_get_pages_dmabuf (
    obj=0xfffffe01429d2000)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c:195
#12 0xffffffff82f60686 in __i915_gem_object_get_pages (obj=0xfffffe01429d2000)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_pages.c:126
#13 0xffffffff82e4fae6 in i915_gem_object_pin_pages (obj=0xfffff8000627f000)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_object.h:330
#14 vma_get_pages (vma=0xfffffe01429de600)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/i915_vma.c:804
#15 i915_vma_pin_ww (vma=vma@entry=0xfffffe01429de600, ww=<optimized out>, 
    size=<optimized out>, size@entry=0, alignment=<optimized out>, 
    alignment@entry=0, flags=2177)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/i915_vma.c:882
#16 0xffffffff82f5af33 in eb_pin_vma (eb=0xfffffe0143b178c0, 
    entry=0xfffff8038ba65c70, ev=0xfffff8038ba65df0)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:444
#17 eb_validate_vmas (eb=<optimized out>, eb@entry=0xfffffe0143b178c0)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:906
#18 0xffffffff82f5a04a in eb_relocate_parse (eb=eb@entry=0xfffffe0143b178c0)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2104
#19 0xffffffff82f58cd8 in i915_gem_do_execbuffer (
    dev=dev@entry=0xfffffe0140a61000, linux_file=<optimized out>, 
    linux_file@entry=0xfffff8038b4eb400, args=args@entry=0xfffffe0143b17bc0, 
    exec=exec@entry=0xfffff8038ba65c00)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:3147
#20 0xffffffff82f59182 in i915_gem_execbuffer2_ioctl (
    dev=dev@entry=0xfffffe0140a61000, data=data@entry=0xfffffe0143b17bc0, 
    linux_file=linux_file@entry=0xfffff8038b4eb400)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:3414
#21 0xffffffff82fea7aa in drm_ioctl_kernel (
    linux_file=linux_file@entry=0xfffff800012bb000, 
    func=func@entry=0xffffffff82f58fe0 <i915_gem_execbuffer2_ioctl>, 
    kdata=kdata@entry=0xfffffe0143b17bc0, flags=32)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/drm_ioctl.c:806
#22 0xffffffff82feab59 in drm_ioctl (filp=0xfffff800012bb000, 
    cmd=<optimized out>, arg=<optimized out>)
    at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.113_9/drivers/gpu/drm/drm_ioctl.c:909
#23 0xffffffff80de937f in linux_file_ioctl_sub (fp=0xfffff80004e70140, 
    filp=0xfffff800012bb000, fop=<optimized out>, cmd=<optimized out>, 
    data=<optimized out>, td=<optimized out>)
    at /usr/src/sys/compat/linuxkpi/common/src/linux_compat.c:1116
#24 linux_file_ioctl (fp=0xfffff80004e70140, cmd=<optimized out>, 
    data=<optimized out>, cred=<optimized out>, td=0xfffffe0143884c80)
    at /usr/src/sys/compat/linuxkpi/common/src/linux_compat.c:1733
#25 0xffffffff80c175db in fo_ioctl (fp=0xfffff80004e70140, 
    com=<optimized out>, data=0x28a22, active_cred=0x1, td=0xfffffe0143884c80)
    at /usr/src/sys/sys/file.h:361
#26 kern_ioctl (td=0xfefefefefefefeff, fd=13, com=com@entry=2151703657, 
    data=0x28a22 <error: Cannot access memory at address 0x28a22>, 
    data@entry=0xfffffe0143b17d50 "") at /usr/src/sys/kern/sys_generic.c:803
#27 0xffffffff80c172e1 in sys_ioctl (td=<optimized out>, 
    uap=0xfffffe0143885068) at /usr/src/sys/kern/sys_generic.c:711
#28 0xffffffff8103e6ec in syscallenter (td=0xfffffe0143884c80)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#29 amd64_syscall (td=0xfffffe0143884c80, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185
#30 <signal handler called>
#31 0x000000080079f8da in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe358
amshafer commented 1 year ago

Ah yes, this is because we need https://reviews.freebsd.org/D37611 which is atm only present in 14.0. I've reached out to see if we can get that merged.

therontarigo commented 1 year ago

Following through to https://github.com/freebsd/drm-kmod/pull/218/files I find the given fix is already present in the drm-510-kmod I tested against. (Interestingly checked against if __FreeBSD_version < 1301507, suggesting the in-tree fix has already been MFC'd to 13-STABLE)

amshafer commented 1 year ago

Ah just realized you're on release. Hm it does look like an issue with that function in drm-kmod then, I'll take a look

therontarigo commented 1 year ago

dma_map_sgtable is dereferencing a null sgt - the problem occurs before this function.

amshafer commented 1 year ago

Working on reproducing this, what version of Xorg are you running and does it contain https://gitlab.freedesktop.org/xorg/xserver/-/merge_requests/1009? Also, what's your xorg.conf?

therontarigo commented 1 year ago

That file renaming shouldn't change anything about Xorg functionality. Anyway, it is not present in the FreeBSD port. Xorg's PRIME offload is already working with nvidia-modeset without DRM.

As for xorg.conf, I have nothing that affects PRIME, Nvidia, or DRM: No xorg.conf, in xorg.conf.d I have handwritten intel.conf to specify Intel device (SNA/TearFree/BusID/DRI=3) and an input.conf for touchpad settings. Since perhaps the intel Driver is having some configuration effect on kernel DRM, here it is anyway:

Section "Device"
        Option      "AccelMethod" "sna"
        Option      "TearFree" "true"
        Identifier  "Card0"
        Driver      "intel"
        Option      "DRI" "3"
        BusID       "PCI:0:2:0"
EndSection
therontarigo commented 1 year ago

I suppose i915kms is a suspect here.

amshafer commented 1 year ago

It's not just a file rename, it also enables the file (bits that interact with DRM for PRIME in the X server) for FreeBSD. You'll need it for prime X configs (such as those generated by nvidia-xconfig -prime).

Good to know about the X config though, I'll give that a try.

therontarigo commented 1 year ago

Ah, I see now the line was moved from linux section to BSD section of the build file.

Now I must check whether /usr/local/lib/xorg/modules/drivers/nvidia_drv.so and /usr/local/lib/xorg/modules/extensions/libglxserver_nvidia.so.1 are used at all in the Xorg GLX Nvidia offload I've been using, or /usr/local/lib/libGLX_nvidia.so.0 entirely bypasses any nvidia<->xorg interaction. I think the latter, Xorg being unaware of any nvidia modules in this configuration.

Curious that nvidia-drm presence breaks this in any way, when PRIME is not configured at all.

therontarigo commented 1 year ago

Xorg with intel video is apparently necessary to reproduce the panic: nvidia, nvidia-modeset, nvidia-drm are loaded Xorg :1 -config xorg-dummy.conf -configdir /dev/null - no GPU interaction Xorg :0 - using intel video driver env DISPLAY=:1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxgears -> works, renders on Nvidia env DISPLAY=:0 __GLX_VENDOR_LIBRARY_NAME=nvidia glxgears -> system hang, presumed to be the originally reported panic (drm system in some circumstances hangs the system instead of doing a crashdump, it is an unrelated bug)

xorg-dummy.conf

Section "ServerFlags"
        Option "AutoAddDevices" "false"
EndSection

Section "Device"
        Identifier  "Card0"
        Driver      "dummy"
EndSection
amshafer commented 1 year ago

Still curious what your Xorg version is. I'm assuming whatever is the latest package? If it isn't too much of a pain I would recommend building with that MR that I linked earlier.

therontarigo commented 1 year ago

Oh, I forgot. It is 21.1.4. If you insist, I can try the patch, but it shouldn't be necessary to reproduce the panic. If it only happens on my hardware - I'll try to dig into this myself. (To be clear, I'm more interested in solving the panic than to have Xorg+DRM+PRIME working any time soon - let's focus on kernel module stability before worrying about Xorg.)

amshafer commented 1 year ago

I can reproduce a hang with vkcube but still working on getting an actual stack trace. Annoyingly my setup for PRIME doesn't seem to work out of the box when I moved back to 13.1-RELEASE from CURRENT. I asked about trying the patch mostly since I know top of tree Xorg works with it since I normally run that.

If you're feeling brave enough to poke around in kgdb to see if you can tell what's going wrong that would be helpful. Just from looking at your output it looks like for some reason __nv_drm_nvkms_gem_obj_init gets called with a memory section that has no pages backing it (based on the dmesg warning). You could add a call to os_dump_stack (which lives in src/nvidia/nvidia_os.c iirc) to check where __nv_drm_nvkms_gem_obj_init gets called from.

amshafer commented 1 year ago

Ah okay, finally got a good repro of this.

amshafer commented 1 year ago

Looking into this a little more I think this might be a non-bsd-spefic nvidia-drm bug. I found the following report which shows something similar that I'll look into: https://forums.developer.nvidia.com/t/wayland-nvidia-drm-desktop-freezes-when-playing-video-via-mpv-using-nvdec/215143

There also seems to be some issues with X properly auto-configuring secondary GPUS. Can you include your /var/log/Xorg.0.log by chance?

amshafer commented 1 year ago

This looks like it comes from a lack of nv_get_phys_pages being implemented, that'll take a little but I am working on.

amshafer commented 1 year ago

Can you please try with the new 525.78.01 branch? It should contain the needed fix. From testing on my end the issue goes away, so I feel reasonably confident you'll see the same.