CTSRD-CHERI / cheribsd

FreeBSD adapted for CHERI-RISC-V and Arm Morello.
http://cheribsd.org
Other
158 stars 58 forks source link

panic: Assertion vmap != NULL failed at /usr/src/sys/dev/drm/freebsd/drm_os_freebsd.c:370 #2097

Open rwatson opened 2 months ago

rwatson commented 2 months ago

Running with a kernel/userlevel from #2080, I saw this kernel panic when starting an aarch64 Chromium web browser within an otherwise entirely purecap (kernel, userlevel, desktop) environment:

panic: Assertion vmap != NULL failed at /usr/src/sys/dev/drm/freebsd/drm_os_freebsd.c:370

The kernel build was:

FreeBSD cheri-blossom.sec.cl.cam.ac.uk 15.0-CURRENT FreeBSD 15.0-CURRENT #19 c18n_procstat-n268168-8e6f163a2c50: Tue Apr 9 02:29:44 UTC 2024 robert@cheri-blossom.sec.cl.cam.ac.uk:/usr/obj/usr/src/arm64.aarch64c/sys/GENERIC-MORELLO-PURECAP arm64

Async revocation and default enabled c18n are both turned on:

# sysctl security.cheri
security.cheri.ptrace_caps: 0
security.cheri.cloadtags_stride: 4
security.cheri.sealcap: 0x4 [,0x4-0x2000]
security.cheri.runtime_revocation_async: 1
security.cheri.runtime_revocation_every_free_default: 0
security.cheri.runtime_revocation_default: 1
security.cheri.lib_based_c18n_default: 1
security.cheri.bound_legacy_capabilities: 0
security.cheri.abort_on_memcpy_tag_loss: 0
security.cheri.debugger_on_sandbox_syscall: 0
security.cheri.stats.untagged_ptrace_caps: 0
security.cheri.stats.forged_ptrace_caps: 0
security.cheri.stats.syscall_violations: 0
security.cheri.capability_size: 16
security.cheri.cidcap: 0x0 [,0x0-0x8000000000000000]

Console output:

panic: Assertion vmap != NULL failed at /usr/src/sys/dev/drm/freebsd/drm_os_freebsd.c:370
cpuid = 2
time = 1714877899
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
vpanic() at vpanic+0x17c
panic() at panic+0x48
drm_cdev_pager_populate() at drm_cdev_pager_populate+0x218
vm_fault_allocate() at vm_fault_allocate+0x434
vm_fault() at vm_fault+0x45c
vm_fault_trap() at vm_fault_trap+0x78
data_abort() at data_abort+0x1b8
do_el0_sync() at do_el0_sync+0xac
handle_el0_sync() at handle_el0_sync+0x30
--- exception, esr 0x92000047

KGDB on the crashdump reports:

b) bt
#0  get_curthread () at /usr/src/sys/arm64/include/pcpu.h:92
#1  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:411
#2  0xffff00000057ca30 in kern_reboot (howto=18007856)
    at /usr/src/sys/kern/kern_shutdown.c:529
#3  0xffff00000057d028 in vpanic (
    fmt=0xffff000000a852fb [rR,0xffff000000a852fb-0xffff000000a85318] (invalid) "Assertion %s failed at %s:%d", ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:989
#4  0xffff00000057cd10 in panic (
    fmt=0xffff000000a852fb [rR,0xffff000000a852fb-0xffff000000a85318] (invalid) "Assertion %s failed at %s:%d") at /usr/src/sys/kern/kern_shutdown.c:913
#5  0xffff0000001dbe6c in drm_cdev_pager_populate (
    vm_obj=0xffffa080d61241f0 [rwRW,0xffffa080d61241f0-0xffffa080d61243e0] (invalid), pidx=0, fault_type=0, max_prot=<optimized out>, 
    first=0xffff0001943a1db0 [rwRW,0xffff0001943a1db0-0xffff0001943a1db8] (invalid), 
    last=0xffff0001943a1da8 [rwRW,0xffff0001943a1da8-0xffff0001943a1db0] (invalid))
    at /usr/src/sys/dev/drm/freebsd/drm_os_freebsd.c:371
#6  0xffff0000008bc1f8 in vm_pager_populate (object=0x1, pidx=18446462598750848816, 
    fault_type=16901008, 
    first=0xffff00000112c124 <boottrace_enabled> [rwRW,0xffff00000112c124-0xffff00000112c125] (invalid), 
    last=0xffff0000010ab174 <cold> [rwRW,0xffff0000010ab174-0xffff0000010ab178] (invalid), max_prot=<optimized out>) at /usr/src/sys/vm/vm_pager.h:185
#7  vm_fault_populate (
    fs=0xffff0001943a2010 [rwRW,0xffff0001943a2010-0xffff0001943a2110] (invalid))
    at /usr/src/sys/vm/vm_fault.c:668
#8  vm_fault_allocate (
    fs=0xffff0001943a2010 [rwRW,0xffff0001943a2010-0xffff0001943a2110] (invalid))
    at /usr/src/sys/vm/vm_fault.c:1500
#9  0xffff0000008ba7a4 in vm_fault_object (
    fs=0xffff0001943a2010 [rwRW,0xffff0001943a2010-0xffff0001943a2110] (invalid), 
    behindp=<optimized out>, aheadp=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--
    at /usr/src/sys/vm/vm_fault.c:1767
#10 vm_fault (map=<optimized out>, vaddr=2261975040, fault_type=0 '\000', 
    fault_flags=<optimized out>, m_hold=<optimized out>)
    at /usr/src/sys/vm/vm_fault.c:1901
#11 0xffff0000008ba198 in vm_fault_trap (
    map=0xffff00019890c140 [rwRW,0xffff00019890c140-0xffff00019890c420] (invalid), 
    vaddr=18446462598750318964, fault_type=0 '\000', fault_flags=16901008, 
    signo=0xffff0001943a22ec [rwRW,0xffff0001943a22ec-0xffff0001943a22f0] (invalid), 
    ucode=0xffff0001943a22e8 [rwRW,0xffff0001943a22e8-0xffff0001943a22ec] (invalid))
    at /usr/src/sys/vm/vm_fault.c:925
#12 0xffff00000093e4a8 in data_abort (
    td=0xffff00019a659c80 [rwRW,0xffff00019a659c80-0xffff00019a65a5f0] (invalid), 
    frame=0xffff0001943a25d0 [rwRW,0xffff00019439d000-0xffff0001943a3000] (invalid), 
    esr=18446462598750848816, far=18446638520598272768, lower=18006308)
    at /usr/src/sys/arm64/arm64/trap.c:485
#13 0xffff00000093d678 in do_el0_sync (
    td=0xffff00019a659c80 [rwRW,0xffff00019a659c80-0xffff00019a65a5f0] (invalid), 
    frame=0xffff0001943a25d0 [rwRW,0xffff00019439d000-0xffff0001943a3000] (invalid))
    at /usr/src/sys/arm64/arm64/trap.c:781
#14 <signal handler called>
#15 0x0000000044046934 in ?? ()
#16 0x000000005e7a3a44 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The process in question was plasmashell

markjdb commented 2 months ago

So DRM maintains a global linked list of VMA structures that reference mapped GEM objects. The VM object corresponding to a GEM object keeps a pointer to the GEM object in handle; during a fault, we look up the VMA using the handle as a key. We crashed because we took a fault but couldn't find the mapping VMA in the global list.

The VM object and GEM object look fine, i.e., they're valid, haven't been destroyed. The refcount on the VM object is 2. Interestingly, at the time of the panic, a child process of plasmashell was in the middle of exec'ing, so it was busy freeing its mappings and dropping VM object references.

In drm_fstub_do_mmap(), we allocate a VMA and insert it into the global list. But apparently we handle the case where the key already exists, i.e., the GEM object has already been mapped via more than one VM object. In this case, we don't insert the VMA:

523                 rw_wlock(&drm_vma_lock);                                                                                                                                                                                                                                                                                  
524                 TAILQ_FOREACH(ptr, &drm_vma_head, vm_entry) {                                                                                                                                                                                                                                                             
525                         if (ptr->vm_private_data == vm_private_data)                                                                                                                                                                                                                                                      
526                                 break;                                                                                                                                                                                                                                                                                    
527                 }                                                                                                                                                                                                                                                                                                         
528                 /* check if there is an existing VM area struct */                                                                                                                                                                                                                                                        
529                 if (ptr != NULL) {                                                                                                                                                                                                                                                                                        
530                         /* check if the VM area structure is invalid */                                                                                                                                                                                                                                                   
531                         if (ptr->vm_ops == NULL ||                                                                                                                                                                                                                                                                        
532                             ptr->vm_ops->open == NULL ||                                                                                                                                                                                                                                                                  
533                             ptr->vm_ops->close == NULL) {                                                                                                                                                                                                                                                                 
534                                 rv = ESTALE;                                                                                                                                                                                                                                                                              
535                                 vm_no_fault = 1;                                                                                                                                                                                                                                                                          
536                         } else {                                                                                                                                                                                                                                                                                          
537                                 rv = EEXIST;                                                                                                                                                                                                                                                                              
538                                 vm_no_fault = (ptr->vm_ops->fault == NULL);                                                                                                                                                                                                                                               
539                         }                                                                                                                                                                                                                                                                                                 
540                 } else {                                                                                                                                                                                                                                                                                                  
541                         /* insert VM area structure into list */                                                                                                                                                                                                                                                          
542                         TAILQ_INSERT_TAIL(&drm_vma_head, vmap, vm_entry);                                                                                                                                                                                                                                                 
543                         rv = 0;                                                                                                                                                                                                                                                                                           
544                         vm_no_fault = (vmap->vm_ops->fault == NULL);                                                                                                                                                                                                                                                      
545                 }                                                                                                                                                                                                                                                                                                         
546                 rw_wunlock(&drm_vma_lock);                                                                                                                                                                                                                                                                                
547                                                                                                                                                                                                                                                                                                                           
548                 if (rv != 0) {                                                                                                                                                                                                                                                                                            
549                         /* free allocated VM area struct */                                                                                                                                                                                                                                                               
550                         drm_vmap_free(vmap);                                                                                                                                                                                                                                                                              
551                         /* check for stale VM area struct */                                                                                                                                                                                                                                                              
552                         if (rv != EEXIST)                                                                                                                                                                                                                                                                                 
553                                 return (rv);                                                                                                                                                                                                                                                                              
554                 }                                                                                                                                                                                                                                                                                                         
555                                                                                                                                                                                                                                                                                                                           
556                 /* check if there is no fault handler */                                                                                                                                                                                                                                                                  
557                 if (vm_no_fault) {                                                                                                                                                                                                                                                                                        
558                         *obj = cdev_pager_allocate(vm_private_data,                                                                                                                                                                                                                                                       
559                             OBJT_DEVICE, &drm_dev_pg_ops, size, prot,                                                                                                                                                                                                                                                     
560                             *foff, td->td_ucred);                                                                                                                                                                                                                                                                         
561                 } else {                                                                                                                                                                                                                                                                                                  
562                         *obj = cdev_pager_allocate(vm_private_data,                                                                                                                                                                                                                                                       
563                             OBJT_MGTDEVICE, &drm_mgtdev_pg_ops, size, prot,                                                                                                                                                                                                                                               
564                             *foff, td->td_ucred);                                                                                                                                                                                                                                                                         
565                 }

So in the rv = EEXIST case, it looks like we can end up with >1 VM object that references the same GEM object. When one of those VM objects is destroyed (maybe triggered by the child process exec()ing), the corresponding VMA will be removed from the global list, but then a fault on a different referencing VM object will fail to look up a VMA, and we'll trigger the panic that Robert hit. I can't really see any other way it could happen.

Thus my questions are:

  1. Under what circumstances does drm_fstub_do_mmap() find an existing GEM object in the global list?
  2. If there is a legitimate explanation for 1, how do we handle it? A refcount on the VMA list entries would help, but some of those fields assume that the mapping is unique. That is, if you have two mappings referencing a GEM object via the VMA list, what meaning do the vm_start and vm_end fields of the VMA have? @bukinr do you have any idea what causes 1)?
bukinr commented 2 months ago

I am trying to figure out. I inserted printfs, opened chromium (10 tabs), so far it does not reach any of these printf paths.

                /* check if there is an existing VM area struct */
                if (ptr != NULL) {
                        /* check if the VM area structure is invalid */
                        if (ptr->vm_ops == NULL ||
                            ptr->vm_ops->open == NULL ||
                            ptr->vm_ops->close == NULL) {
printf("%s stale\n", __func__);
                                rv = ESTALE;
                                vm_no_fault = 1;
                        } else {
printf("%s eexist\n", __func__);
                                rv = EEXIST;
                                vm_no_fault = (ptr->vm_ops->fault == NULL);
                        }

Note there are around 300 entries in the drm_vma_head when chromium is open.

rwatson commented 2 months ago

In my experience, sudden termination of the display server can cause unhappiness in the kernel when other applications bump into the mess left behind. I wonder if it could be the case that something else triggered a bug in the display server .. and then plasmashell tripped over it?

bukinr commented 2 months ago

could be no memory situation that is not handled correctly. We don't have IOMMU enabled (iommu code exists and working). I got another panic during vm_page_reclaim_contig() that panfrost is calling when it can't allocate large chunks of contiguos memory

WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:617
WARNING !drm_modeset_is_locked(&dev->mode_config.connection_mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:667
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
<3>[drm: 0xffff0000001768c8] *ERROR* [CRTC:33:crtc-0] hw_done timed out
<3>[drm: 0xffff0000001768f4] *ERROR* [CRTC:33:crtc-0] flip_done timed out
<3>[drm: 0xffff00000017697c] *ERROR* [CONNECTOR:35:HDMI-A-1] hw_done timed out
<3>[drm: 0xffff0000001769a8] *ERROR* [CONNECTOR:35:HDMI-A-1] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:31:plane-0] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:31:plane-0] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:32:plane-1] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:32:plane-1] flip_done timed out
panic: page 0xffffa08368d80080 is PG_FICTITIOUS or PG_MARKER
cpuid = 2
time = 1715087326
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x34
vpanic() at vpanic+0x13c
panic() at panic+0x64
vm_page_reclaim_contig_domain_ext() at vm_page_reclaim_contig_domain_ext+0xfc4
vm_page_reclaim_contig() at vm_page_reclaim_contig+0x70
panfrost_gem_get_pages() at panfrost_gem_get_pages+0xd8
panfrost_gem_open() at panfrost_gem_open+0x190
drm_gem_handle_create_tail() at drm_gem_handle_create_tail+0x180
panfrost_gem_create_object_with_handle() at panfrost_gem_create_object_with_handle+0x114
panfrost_ioctl_create_bo() at panfrost_ioctl_create_bo+0x38
drm_ioctl_kernel() at drm_ioctl_kernel+0xcc
drm_ioctl() at drm_ioctl+0x194
drm_fstub_ioctl() at drm_fstub_ioctl+0x84
kern_ioctl() at kern_ioctl+0x2e0
user_ioctl() at user_ioctl+0x178
do_el0_sync() at do_el0_sync+0x630
handle_el0_sync() at handle_el0_sync+0x3c
--- exception, esr 0x56000000
KDB: enter: panic
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:617
WARNING !drm_modeset_is_locked(&dev->mode_config.connection_mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:667
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
<3>[drm: 0xffff0000001768c8] *ERROR* [CRTC:33:crtc-0] hw_done timed out
<3>[drm: 0xffff0000001768f4] *ERROR* [CRTC:33:crtc-0] flip_done timed out
<3>[drm: 0xffff00000017697c] *ERROR* [CONNECTOR:35:HDMI-A-1] hw_done timed out
<3>[drm: 0xffff0000001769a8] *ERROR* [CONNECTOR:35:HDMI-A-1] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:31:plane-0] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:31:plane-0] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:32:plane-1] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:32:plane-1] flip_done timed out
[ thread pid 1145 tid 101378 ]
Stopped at      kdb_enter+0x48: str     xzr, [x19]
db>