Closed dumbbell closed 7 months ago
As mentioned in #236, there seem to be some issues with locking. The linuxkpi code wants to take a sleep lock and perform allocations in several places, which does not work well with the vtdev lock and window spinlocks.
The first issue that comes up when you try running the new system with witness enabled is that vd->vd_curwindow->vb_lock
is held during vt_flush
, and drm_fb_helper_damage
tries to add to the linuxkpi work queue in order to send the damage signal.
After bluntly patching that, the next issue is that vt_drmfb_postswitch
locks vtdev. This triggers a huge number of warnings as it performs allocations and eventually a reversal when the vt_upgrade sysinit is called and tries to lock vtdev while holding the giant lock.
After patching that, the final issue is once again in vt_flush, this time with calling bitblt.
Backtraces for the above:
panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) linux_wq_exec @ /root/freebsd-src/sys/compat/linuxkpi/common/src/linux_work.c:105
cpuid = 0
time = 1679351346
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe015c42ca60
vpanic() at vpanic+0x152/frame 0xfffffe015c42cab0
panic() at panic+0x43/frame 0xfffffe015c42cb10
witness_checkorder() at witness_checkorder+0xc5c/frame 0xfffffe015c42ccd0
__mtx_lock_flags() at __mtx_lock_flags+0x94/frame 0xfffffe015c42cd20
linux_queue_work_on() at linux_queue_work_on+0x9e/frame 0xfffffe015c42cd60
vt_drmfb_drawrect() at vt_drmfb_drawrect+0x5b/frame 0xfffffe015c42cd90
vt_flush() at vt_flush+0x2c4/frame 0xfffffe015c42cdf0
vt_timer() at vt_timer+0xe/frame 0xfffffe015c42ce10
softclock_call_cc() at softclock_call_cc+0x14b/frame 0xfffffe015c42cec0
softclock_thread() at softclock_thread+0xc6/frame 0xfffffe015c42cef0
fork_exit() at fork_exit+0x80/frame 0xfffffe015c42cf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe015c42cf30
Mar 21 02:47:49 daisy kernel: uma_zalloc_debug: zone "malloc-2048" with the following non-sleepable locks held:
Mar 21 02:47:49 daisy kernel: exclusive sleep mutex vtdev (vtdev) r = 0 (0xffffffff81aec5f8) locked @ /usr/home/audrey/proj/kernel/kernel/sys/dev/vt/vt_core.c:3057
Mar 21 02:47:49 daisy kernel: stack backtrace:
Mar 21 02:47:49 daisy kernel: #0 0xffffffff80c5ff65 at witness_debugger+0x65
Mar 21 02:47:49 daisy kernel: #1 0xffffffff80c610b9 at witness_warn+0x3f9
Mar 21 02:47:49 daisy kernel: #2 0xffffffff80f70f74 at uma_zalloc_debug+0x34
Mar 21 02:47:49 daisy kernel: #3 0xffffffff80f70a88 at uma_zalloc_arg+0x28
Mar 21 02:47:49 daisy kernel: #4 0xffffffff80bc1bff at malloc+0x7f
Mar 21 02:47:49 daisy kernel: #5 0xffffffff8532b8c0 at intel_atomic_state_alloc+0x20
Mar 21 02:47:49 daisy kernel: #6 0xffffffff854d93b4 at drm_client_modeset_commit_atomic+0x34
Mar 21 02:47:49 daisy kernel: #7 0xffffffff854d95c4 at drm_client_modeset_commit_locked+0x74
Mar 21 02:47:49 daisy kernel: #8 0xffffffff854d9751 at drm_client_modeset_commit+0x21
Mar 21 02:47:49 daisy kernel: #9 0xffffffff8551dd22 at __drm_fb_helper_restore_fbdev_mode_unlocked+0x82
Mar 21 02:47:49 daisy kernel: #10 0xffffffff8551f554 at drm_fb_helper_set_par+0x54 Mar 21 02:47:49 daisy kernel: #11 0xffffffff8546d9f7 at intel_fbdev_set_par+0x17
Mar 21 02:47:49 daisy kernel: #12 0xffffffff8551d825 at vt_drmfb_init+0xe5 Mar 21 02:47:49 daisy kernel: #13 0xffffffff80a358a8 at vt_replace_backend+0x118
Mar 21 02:47:49 daisy kernel: #14 0xffffffff80a3576b at vt_allocate+0x7b
Mar 21 02:47:49 daisy kernel: #15 0xffffffff8551c900 at linux_register_framebuffer+0x150
Mar 21 02:47:49 daisy kernel: #16 0xffffffff8551ffd3 at __drm_fb_helper_initial_config_and_unlock+0x4d3
Mar 21 02:47:49 daisy kernel: #17 0xffffffff8546cff0 at intel_fbdev_initial_config_async+0x20
Mar 21 02:47:49 daisy kernel: lock order reversal: (Giant after non-sleepable)
Mar 21 02:47:49 daisy kernel: 1st 0xffffffff81aec5f8 vtdev (vtdev, sleep mutex) @ /usr/home/audrey/proj/kernel/kernel/sys/dev/vt/vt_core.c:3057
Mar 21 02:47:49 daisy kernel: 2nd 0xffffffff81a02bc0 Giant (Giant, sleep mutex) @ /usr/home/audrey/proj/kernel/kernel/sys/compat/linuxkpi/common/src/linux_compat.c:2312
Mar 21 02:47:49 daisy kernel: lock order Giant -> vtdev established at:
Mar 21 02:47:49 daisy kernel: #0 0xffffffff80c5f26b at witness_checkorder+0x2fb
Mar 21 02:47:49 daisy kernel: #1 0xffffffff80bc8784 at __mtx_lock_flags+0x94
Mar 21 02:47:49 daisy kernel: #2 0xffffffff80a34e16 at vt_upgrade+0x386
Mar 21 02:47:49 daisy kernel: #3 0xffffffff80b73a08 at mi_startup+0x1e8
Mar 21 02:47:49 daisy kernel: #4 0xffffffff8038f023 at btext+0x23
Mar 21 02:47:49 daisy kernel: lock order vtdev -> Giant attempted at:
Mar 21 02:47:49 daisy kernel: #0 0xffffffff80c5fb23 at witness_checkorder+0xbb3
Mar 21 02:47:49 daisy kernel: #1 0xffffffff80bc8784 at __mtx_lock_flags+0x94
Mar 21 02:47:49 daisy kernel: #2 0xffffffff80e73693 at linux_wait_for_timeout_common+0x143
Mar 21 02:47:49 daisy kernel: #3 0xffffffff854cd110 at drm_atomic_helper_wait_for_flip_done+0x70
Mar 21 02:47:49 daisy kernel: #4 0xffffffff8535729a at intel_atomic_commit_tail+0xdba
Mar 21 02:47:49 daisy kernel: #5 0xffffffff85351945 at intel_atomic_commit+0x345 Mar 21 02:47:49 daisy kernel: #6 0xffffffff854d94d8 at drm_client_modeset_commit_atomic+0x158
Mar 21 02:47:49 daisy kernel: #7 0xffffffff854d95c4 at drm_client_modeset_commit_locked+0x74
Mar 21 02:47:49 daisy kernel: #8 0xffffffff854d9751 at drm_client_modeset_commit+0x21
Mar 21 02:47:49 daisy kernel: #9 0xffffffff8551dd22 at __drm_fb_helper_restore_fbdev_mode_unlocked+0x82
Mar 21 02:47:49 daisy kernel: #10 0xffffffff8551f554 at drm_fb_helper_set_par+0x54
Mar 21 02:47:49 daisy kernel: #11 0xffffffff8546d9f7 at intel_fbdev_set_par+0x17
Mar 21 02:47:49 daisy kernel: #12 0xffffffff8551d825 at vt_drmfb_init+0xe5
Mar 21 02:47:49 daisy kernel: #13 0xffffffff80a358a8 at vt_replace_backend+0x118
Mar 21 02:47:49 daisy kernel: #14 0xffffffff80a3576b at vt_allocate+0x7b
Mar 21 02:47:49 daisy kernel: #15 0xffffffff8551c900 at linux_register_framebuffer+0x150
Mar 21 02:47:49 daisy kernel: #16 0xffffffff8551ffd3 at __drm_fb_helper_initial_config_and_unlock+0x4d3
Mar 21 02:47:49 daisy kernel: #17 0xffffffff8546cff0 at intel_fbdev_initial_config_async+0x20
Mar 21 02:54:03 daisy kernel: panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) linux_wq_exec @ /usr/home/audrey/proj/kernel/kernel/sys/compat/linu xkpi/common/src/linux_work.c:105
Mar 21 02:54:03 daisy kernel: cpuid = 0
Mar 21 02:54:03 daisy kernel: time = 1679392069
Mar 21 02:54:03 daisy kernel: KDB: stack backtrace:
Mar 21 02:54:03 daisy kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe010812b9b0
Mar 21 02:54:03 daisy kernel: vpanic() at vpanic+0x152/frame 0xfffffe010812ba00
Mar 21 02:54:03 daisy kernel: panic() at panic+0x43/frame 0xfffffe010812ba60
Mar 21 02:54:03 daisy kernel: witness_checkorder() at witness_checkorder+0xc5c/frame 0xfffffe010812bc20
Mar 21 02:54:03 daisy kernel: __mtx_lock_flags() at __mtx_lock_flags+0x94/frame 0xfffffe010812bc70
Mar 21 02:54:03 daisy kernel: linux_queue_work_on() at linux_queue_work_on+0x9e/frame 0xfffffe010812bcb0
Mar 21 02:54:03 daisy kernel: vt_drmfb_bitblt_text() at vt_drmfb_bitblt_text+0x20a/frame 0xfffffe010812bd90
Mar 21 02:54:03 daisy kernel: vt_flush() at vt_flush+0x433/frame 0xfffffe010812bdf0
Mar 21 02:54:03 daisy kernel: vt_timer() at vt_timer+0xe/frame 0xfffffe010812be10
Mar 21 02:54:03 daisy kernel: softclock_call_cc() at softclock_call_cc+0x14b/frame 0xfffffe010812bec0
Mar 21 02:54:03 daisy kernel: softclock_thread() at softclock_thread+0xc6/frame 0xfffffe010812bef0
Mar 21 02:54:03 daisy kernel: fork_exit() at fork_exit+0x80/frame 0xfffffe010812bf30
Mar 21 02:54:03 daisy kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe010812bf30
Mar 21 02:54:03 daisy kernel: --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Here's the patches I came up with. I have no idea what invariants exactly these locks are supposed to guard, so I cannot comment on their correctness :)
kernel:
diff --git a/sys/dev/vt/vt_core.c b/sys/dev/vt/vt_core.c
index 267dd7bf2489..80002b9c21f9 100644
--- a/sys/dev/vt/vt_core.c
+++ b/sys/dev/vt/vt_core.c
@@ -1406,17 +1406,23 @@ vt_flush(struct vt_device *vd)
vd->vd_flags &= ~VDF_INVALID;
a = teken_get_curattr(&vw->vw_terminal->tm_emulator);
+ /*
+ * The below calls may need to run a sleep lock in order to
+ * dispatch linuxkpi work
+ */
+ vtbuf_unlock(&vw->vw_buf);
vt_set_border(vd, &vw->vw_draw_area, a->ta_bgcolor);
vt_termrect(vd, vf, &tarea);
if (vd->vd_driver->vd_invalidate_text)
vd->vd_driver->vd_invalidate_text(vd, &tarea);
if (vt_draw_logo_cpus)
vtterm_draw_cpu_logos(vd);
+ vtbuf_lock(&vw->vw_buf);
}
if (tarea.tr_begin.tp_col < tarea.tr_end.tp_col) {
- vd->vd_driver->vd_bitblt_text(vd, vw, &tarea);
vtbuf_unlock(&vw->vw_buf);
+ vd->vd_driver->vd_bitblt_text(vd, vw, &tarea);
return (1);
}
drm-kmod:
diff --git a/drivers/gpu/drm/vt_drmfb.c b/drivers/gpu/drm/vt_drmfb.c
index 116121e14f..62ef106d0c 100644
--- a/drivers/gpu/drm/vt_drmfb.c
+++ b/drivers/gpu/drm/vt_drmfb.c
@@ -257,6 +257,7 @@ vt_drmfb_postswitch(struct vt_device *vd)
{
struct fb_info *fbio;
struct linux_fb_info *info;
+ int locked = mtx_owned(&vd->vd_lock);
fbio = vd->vd_softc;
info = to_linux_fb_info(fbio);
@@ -266,8 +267,14 @@ vt_drmfb_postswitch(struct vt_device *vd)
}
if (!kdb_active && !KERNEL_PANICKED()) {
+ if (locked) {
+ VT_UNLOCK(vd);
+ }
linux_set_current(curthread);
info->fbops->fb_set_par(info);
+ if (locked) {
+ VT_LOCK(vd);
+ }
} else {
#ifdef DDB
db_trace_self_depth(10);
I'm not sure that releasing the window buffer can work in vt_flush()
. I see two possible issues:
What I'm experimenting with on my side is to change the lock type of the lock protecting the workqueue
implementation in linuxkpi (a "default" mutex so far) to the same type as the one used in drm_fb_helper
(a "spin" mutex). I'm not sure it is the perfect solution either though...
Could you post the patch which does that so I can test it? I found some fairly intense graphical corruption switching to hikari while driving my patches so I suspect you may be right.
@rhelmot: The computer crashes hard with my wip patch unfortunately.
I think the corruption you see in Hikari is unrelated. The vt(4) integration layer only drives the display of the console, the text terminals before starting an X.Org or Wayland session. Once under X.Org/Wayland, only the DRM drivers are used exclusively and there should be no corruption.
Note that with the 5.17 drivers on my Intel 12th gen GPU, Wayland (using Sway) is unusable for me (only garbage on the screen). It works with 5.15-lts however, so this corruption must be fixed at some point in 5.18+. I need to take a picture of the screen and share it to compare with what you see.
In case this helps: I have similar console problems in Void Linux with kernel 6.1 and the Nvidia proprietary driver. Switching to a new virtual console with ctrl-alt-fnum works, and scrolls normally until you get to the bottom of the screen. It then appears to freeze, but switching back and forth between consoles causes it to refresh.
Hello!
I've been away in July. I'm back home and slowly resume work on this vt(4) integration and the Linux 5.17 DRM drivers port.
After polishing things a bit today, including commit history and commit messages, I'm starting to use the 5.17 DRM drivers on a daily basis. This includes the new vt(4) integration. We'll see how stable it is or not.
Note that there is one thing left to work on: the support for masks in the bitblt operation. To display the mouse pointer in the console, FreeBSD uses a mask it passes to bitblt. Linux doesn't have such a thing. I think it supports an alpha channel but I'm not sure. Anyway, currently the mask is commented out in the new vt(4) integration which means the mouse pointer is never displayed for now.
I have added mouse pointer mask support this morning. The new vt(4) integration is complete and is now ready for wider testing. I admit I only tested it as part of the Linu 5.17 port so far.
Note that this new integration depends on changes committed to the linuxkpi-updates-for-drm
branch in my freebsd-src fork.
Once I'm more confident with the quality, I will submit reviews to reviews.freebsd.org for the patchs to freebsd-src and update the pull request description with links to the reviews.
@JustAnotherHumanBeing reported the following problem with the vt(4) integration in #236:
So far, I noticed only one problem after installing the linuxkpi-updates-for-drm kernel and the i915kms driver from the 5.17 branch. If I have the setting kern.vt.splash_cpu=1 in my loader.conf and the splash logos are still on the screen when it's time for i915kms to take over (it could be the case if kern.vt.splash_cpu_duration is set to the value that is not too small), then the logos will become invisible and the area of the screen that would otherwise have been occupied by the logos will become inaccessible by any tty-based app until the value of kern.vt.splash_cpu_duration has expired.
I just pushed some fixes to both freebsd-src and the update-to-v5.17 branch (which are meant for this pull request, just not had time to rebase yet). However, I didn't look at the issue reported above. I will look at it next.
I found the problem with missing CPU logos. The nested loops in fb_fillrect()
was incorrect and never drawn anything.
I'm still not satisfied with the overall patch though. Still doing some improvements before I push it again to GitHub.
Ok, new commits pushed. Compared to last week, I changed the overall approach.
Initially, I changed vt(4) to support vd_pre_bitblt_text()
and post_bitblt_text()
callbacks to allow the driver to run code outside of the vt_buf lock. In the context of the integration with DRM, it allowed the new integration vt_drmfb
to proceed with the display of the buffer after the console content was drawn to that buffer. That's the code that acquires default mutex and can't run inside a held spin mutex section.
This worked well. Unfortunately, it meant that the screen was actually refreshed once after the entire buffer was updated. With the old integration, the screen was refreshed after each displayed character. Because of that, showing a lot of content like cat text-file-with-thousands-of-lines.txt
looked really sluggish, even though the overall time to display the content was the same. The reason was that the screen was refreshed less frequently and a lot of content changed at once.
The new approach is based on a flag set by the driver. If the driver sets this flag, the vt_flush()
function in vt(4) uses its internal vd_drawn*
arrays and a new vd_pos_to_flus
array to dump everything that needs to be refreshed. It then unlocks vt_buf and calls the driver. Basically steps are split and executed in a slightly different order compared to vt_flush()
with the flag set to false (the default).
This time, the driver can refresh the screen after each character and still run outside of the vt_buf-locked section.
When I compare the speed at which cat(1) can write that big text file to the console, the new integration is a couple percent faster. We are talking 1.35 seconds with the old integration vs. 1.30 seconds with the new one for a file with 500,000 lines on my Intel 12th gen laptop.
The problem reported by @JustAnotherHumanBeing is also fixed.
I'm going to mark this pull request as ready to review again.
I believe the instabilities I saw with the amdgpu
driver (both with master
and update-to-v5.17
are now solved, thanks to the commit titled "vt(4) integration: Move fictitious range (un)register out of (un)register_framebuffer()".
I'm not marking this pull request as ready for review yet. With the change of GFP_KERNEL
from M_WAITOK
to M_NOWAIT
, it broke the iwlwifi driver's suspend & resume. I'm working on fixing this.
The iwlwifi driver's suspend & resume support is not yet implemented, so my change to GFP_KERNEL
didn't make it any worse.
I pushed another change that make it possible to kldunload(8) the i915kms
driver. I didn't try with the amdgpu
yet. The screen remains black afterwards though.
I'm using the 5.17 update on a daily basis, which includes this new vt(4) integration. I believe it's ready for wider testing and review!
I didn't submit reviews for the patches to freebsd-src yet. Before I do that, I would love some feedback on the overall approach :-)
Could this rewrite fix some issues I have with the nvidia driver and waking up from suspend to RAM (screen comes back up blank or all garbled, but if I start a new X server, that X server works (while previously started X servers do not)).
I don't think it will fix anything related to the NVIDIA proprietary driver. I'm not sure it relies on DRM in the first place.
I really would like to test on my Framework Gen 13 Laptop.
Is there any guide how to do this?
I really would like to test on my Framework Gen 13 Laptop.
Is there any guide how to do this?
You could look at #236, but I doubt Intel 13th gen was already supported in Linux 5.17.
I'm marking this pull request as a draft again. There are regressions around the interaction with kdb/ddb.
I just discovered that drm_fb_helper
exposes two dedicated functions as the fb_debug_enter
and fb_debug_leave
fbdev callbacks. I'm experimenting with them. This will require small changes to vt_core.c
in base as well.
In the process, I stumble upon a problem I saw in the past (but didn't have a solution for): drm_fb_helper
wants to use a callbacks structure which is NULL for the i915 driver. I wonder how this works on Linux because I can't find a place where the driver set it.
The fb_debug_{enter,leave}
lead is not the correct one. I was surprised the code seems to be missing from i915, but in fact, it's on purpose:
commit 9c79e0b1d096c06e4b012a32dddb04be0ce9716e
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Mon Apr 3 10:32:59 2017 +0200
drm/fb-helper: Give up on kgdb for atomic drivers
It just doesn't work. It probably stopped working way, way before that
(e.g. i915 grabbed random mutexes all over in modeset code at least
since gen6), but with atomic and all the ww_mutex stuff it's indeed
hopeless.
Remove ->mode_set_base_atomic from the 2 atomic drivers (i915 and
nouveau) that still had one (both had dummy implementations already
anyway), and shunt atomic drivers in the helpers debug_enter/leave
functions.
(...)
I think the new vt(4) integration is good enough for the time being w.r.t. panic handling. At least it's not worse than the current one in my testing.
We can try to improve it iteratively, but it shouldn't block 5.17 anymore IMHO. I'm marking this pull request as ready to review again.
This patch series replaces our current vt(4) integration layer with a new one relying entirely on
drm_fb_helper.c
.Why
In Linux 5.17, the amdgpu driver relies entirely on
drm_fb_helper
to handle its framebuffer. Indeed, the framebuffer-specific code is removed.Also in Linux 5.17, something changed in the i915 driver. I don't know exactly what, but the memory region passed to vt(4) and
vt_fb
isn't the one displayed on the screen. Therefore, all writes made byvt_fb
leads to no changes on the screen and the computer looks frozen (that's not the case). However, when I vt-switch to another tty, the screen is refreshed to show the content of the tty active before the switch. Switching tty is the only timevt_fb
calls into DRM. There is enough evidence that the i915 driver now assumes that the framebuffer is accessed likedrm_fb_helper
does it, i.e. like a Linux fbdev device.The symptoms are the same when we enable PSR in the i915 driver (
enable_psr=1
). It's currently turned off by default on FreeBSD, unlike Linux.Unfortunately, our vt(4) integration is way more basic than than Linux' fbdev and doesn't reproduce the behavior of an fbdev device.
How
The branch contains two major parts:
drm_fb_helper.c
and related code in DRM drivers. This makes sure that the code compiles, even though it is not used at that point.vt_drmfb
, compiled indrm.ko
. This new backend is responsible for bridging the vt(4) callbacks (upper layer) and the fbdev callbacks (lower layer).This work relies on patches to freebsd-src. They are available in the
linuxkpi-updates-for-drm
branch in my freebsd-src fork. Patches were submitted for reviews to Phabricator:https://reviews.freebsd.org/D42054https://reviews.freebsd.org/D42056https://reviews.freebsd.org/D42057https://reviews.freebsd.org/D42750https://reviews.freebsd.org/D42751https://reviews.freebsd.org/D42752TODO
vtbuf_lock()
invt_flush()
andWQ_EXEC_LOCK()
inschedule_work()
in linuxkpi (sleepable lock acquired inside a non-sleepable lock section).bitblt_image
callback mask argument; there is no equivalent on the fbdev side.This work is required by the Linux 5.17 backport (#236).
Fixes #17.