Open rwatson opened 2 months ago
So DRM maintains a global linked list of VMA structures that reference mapped GEM objects. The VM object corresponding to a GEM object keeps a pointer to the GEM object in handle
; during a fault, we look up the VMA using the handle as a key. We crashed because we took a fault but couldn't find the mapping VMA in the global list.
The VM object and GEM object look fine, i.e., they're valid, haven't been destroyed. The refcount on the VM object is 2. Interestingly, at the time of the panic, a child process of plasmashell was in the middle of exec'ing, so it was busy freeing its mappings and dropping VM object references.
In drm_fstub_do_mmap()
, we allocate a VMA and insert it into the global list. But apparently we handle the case where the key already exists, i.e., the GEM object has already been mapped via more than one VM object. In this case, we don't insert the VMA:
523 rw_wlock(&drm_vma_lock);
524 TAILQ_FOREACH(ptr, &drm_vma_head, vm_entry) {
525 if (ptr->vm_private_data == vm_private_data)
526 break;
527 }
528 /* check if there is an existing VM area struct */
529 if (ptr != NULL) {
530 /* check if the VM area structure is invalid */
531 if (ptr->vm_ops == NULL ||
532 ptr->vm_ops->open == NULL ||
533 ptr->vm_ops->close == NULL) {
534 rv = ESTALE;
535 vm_no_fault = 1;
536 } else {
537 rv = EEXIST;
538 vm_no_fault = (ptr->vm_ops->fault == NULL);
539 }
540 } else {
541 /* insert VM area structure into list */
542 TAILQ_INSERT_TAIL(&drm_vma_head, vmap, vm_entry);
543 rv = 0;
544 vm_no_fault = (vmap->vm_ops->fault == NULL);
545 }
546 rw_wunlock(&drm_vma_lock);
547
548 if (rv != 0) {
549 /* free allocated VM area struct */
550 drm_vmap_free(vmap);
551 /* check for stale VM area struct */
552 if (rv != EEXIST)
553 return (rv);
554 }
555
556 /* check if there is no fault handler */
557 if (vm_no_fault) {
558 *obj = cdev_pager_allocate(vm_private_data,
559 OBJT_DEVICE, &drm_dev_pg_ops, size, prot,
560 *foff, td->td_ucred);
561 } else {
562 *obj = cdev_pager_allocate(vm_private_data,
563 OBJT_MGTDEVICE, &drm_mgtdev_pg_ops, size, prot,
564 *foff, td->td_ucred);
565 }
So in the rv = EEXIST
case, it looks like we can end up with >1 VM object that references the same GEM object. When one of those VM objects is destroyed (maybe triggered by the child process exec()ing), the corresponding VMA will be removed from the global list, but then a fault on a different referencing VM object will fail to look up a VMA, and we'll trigger the panic that Robert hit. I can't really see any other way it could happen.
Thus my questions are:
drm_fstub_do_mmap()
find an existing GEM object in the global list?vm_start
and vm_end
fields of the VMA have?
@bukinr do you have any idea what causes 1)?I am trying to figure out. I inserted printfs, opened chromium (10 tabs), so far it does not reach any of these printf paths.
/* check if there is an existing VM area struct */
if (ptr != NULL) {
/* check if the VM area structure is invalid */
if (ptr->vm_ops == NULL ||
ptr->vm_ops->open == NULL ||
ptr->vm_ops->close == NULL) {
printf("%s stale\n", __func__);
rv = ESTALE;
vm_no_fault = 1;
} else {
printf("%s eexist\n", __func__);
rv = EEXIST;
vm_no_fault = (ptr->vm_ops->fault == NULL);
}
Note there are around 300 entries in the drm_vma_head when chromium is open.
In my experience, sudden termination of the display server can cause unhappiness in the kernel when other applications bump into the mess left behind. I wonder if it could be the case that something else triggered a bug in the display server .. and then plasmashell
tripped over it?
could be no memory situation that is not handled correctly. We don't have IOMMU enabled (iommu code exists and working). I got another panic during vm_page_reclaim_contig() that panfrost is calling when it can't allocate large chunks of contiguos memory
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:617
WARNING !drm_modeset_is_locked(&dev->mode_config.connection_mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:667
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
<3>[drm: 0xffff0000001768c8] *ERROR* [CRTC:33:crtc-0] hw_done timed out
<3>[drm: 0xffff0000001768f4] *ERROR* [CRTC:33:crtc-0] flip_done timed out
<3>[drm: 0xffff00000017697c] *ERROR* [CONNECTOR:35:HDMI-A-1] hw_done timed out
<3>[drm: 0xffff0000001769a8] *ERROR* [CONNECTOR:35:HDMI-A-1] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:31:plane-0] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:31:plane-0] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:32:plane-1] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:32:plane-1] flip_done timed out
panic: page 0xffffa08368d80080 is PG_FICTITIOUS or PG_MARKER
cpuid = 2
time = 1715087326
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x34
vpanic() at vpanic+0x13c
panic() at panic+0x64
vm_page_reclaim_contig_domain_ext() at vm_page_reclaim_contig_domain_ext+0xfc4
vm_page_reclaim_contig() at vm_page_reclaim_contig+0x70
panfrost_gem_get_pages() at panfrost_gem_get_pages+0xd8
panfrost_gem_open() at panfrost_gem_open+0x190
drm_gem_handle_create_tail() at drm_gem_handle_create_tail+0x180
panfrost_gem_create_object_with_handle() at panfrost_gem_create_object_with_handle+0x114
panfrost_ioctl_create_bo() at panfrost_ioctl_create_bo+0x38
drm_ioctl_kernel() at drm_ioctl_kernel+0xcc
drm_ioctl() at drm_ioctl+0x194
drm_fstub_ioctl() at drm_fstub_ioctl+0x84
kern_ioctl() at kern_ioctl+0x2e0
user_ioctl() at user_ioctl+0x178
do_el0_sync() at do_el0_sync+0x630
handle_el0_sync() at handle_el0_sync+0x3c
--- exception, esr 0x56000000
KDB: enter: panic
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !list_empty(&lock->head) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_modeset_lock.c:268
WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:617
WARNING !drm_modeset_is_locked(&dev->mode_config.connection_mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:667
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /home/br/cheri/cheribsd/sys/dev/drm/core/drm_atomic_helper.c:892
<3>[drm: 0xffff0000001768c8] *ERROR* [CRTC:33:crtc-0] hw_done timed out
<3>[drm: 0xffff0000001768f4] *ERROR* [CRTC:33:crtc-0] flip_done timed out
<3>[drm: 0xffff00000017697c] *ERROR* [CONNECTOR:35:HDMI-A-1] hw_done timed out
<3>[drm: 0xffff0000001769a8] *ERROR* [CONNECTOR:35:HDMI-A-1] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:31:plane-0] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:31:plane-0] flip_done timed out
<3>[drm: 0xffff000000176a38] *ERROR* [PLANE:32:plane-1] hw_done timed out
<3>[drm: 0xffff000000176a64] *ERROR* [PLANE:32:plane-1] flip_done timed out
[ thread pid 1145 tid 101378 ]
Stopped at kdb_enter+0x48: str xzr, [x19]
db>
Running with a kernel/userlevel from #2080, I saw this kernel panic when starting an aarch64 Chromium web browser within an otherwise entirely purecap (kernel, userlevel, desktop) environment:
panic: Assertion vmap != NULL failed at /usr/src/sys/dev/drm/freebsd/drm_os_freebsd.c:370
The kernel build was:
FreeBSD cheri-blossom.sec.cl.cam.ac.uk 15.0-CURRENT FreeBSD 15.0-CURRENT #19 c18n_procstat-n268168-8e6f163a2c50: Tue Apr 9 02:29:44 UTC 2024 robert@cheri-blossom.sec.cl.cam.ac.uk:/usr/obj/usr/src/arm64.aarch64c/sys/GENERIC-MORELLO-PURECAP arm64
Async revocation and default enabled c18n are both turned on:
Console output:
KGDB on the crashdump reports:
The process in question was
plasmashell