QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
534 stars 47 forks source link

Full system lockup when resizing a VM window on dom0 kernel 5.17.4-2 or 5.17.5-1 #7481

Open icequbes1 opened 2 years ago

icequbes1 commented 2 years ago

Qubes OS release

R4.1

Brief summary

On a Thinkpad T460s, attempting to resize a VM window (such as xterm), the machine locks up. No response to cursor/mouse movement or keyboard. Hard reset is the only option. Issue was not present under dom0 kernel 5.16.18-2. Issue observed under dom0 5.17.4-2 kernel as well as 5.17.5-1.

Steps to reproduce

  1. Boot physical machine, login
  2. Start any qube. For example, an xterm window under a new fedora-dvm or debian-dvm
  3. Attempt to resize the window using the mouse by dragging the lower-right corner

Expected behavior

Window gets resized.

Actual behavior

Window initially responds to resize while still holding the mouse, but then no longer responds to resizing.

The cursor remains the "resizing" cursor after releasing the mouse. The cursor responds to mouse movement for 1-3 more seconds (e.g., a "shake") but then freezes on screen.

No response to keyboard input such as ctrl+alt+f2. No disk activity seen.

A hard reset of the physical machine is the only recovery available.

No logs are output on-screen (if dom0 had a sudo journalctl -f window open), nor when inspecting the journal after reboot.

The only observation as to the cause is that a top -d 0.5 window sometimes does show the Xorg process at the top and in the D (uninterruptible state).

Other tests performed:

icequbes1 commented 2 years ago

Issue may be closely related to #7479, however in 7479 does not describe a full system lockup but appears recoverable by being brought back to the login screen.

icequbes1 commented 2 years ago

Issue persists with kernel-latest-5.17.7-1.

DemiMarie commented 2 years ago

Does SysRq still work?

icequbes1 commented 2 years ago

Ah, yes. Now we're getting somewhere. The system does respond to SysRq.

Blocked tasks [sysrq+w] show Xorg and i915 workqueues blocked when releasing the mouse (while the cursor still responds to mouse movement, but is "locked" on the window-resizing cursor):

May 13 12:15:32 dom0 kernel: sysrq: Show Blocked State
May 13 12:15:32 dom0 kernel: task:Xorg            state:D stack:    0 pid: 3147 ppid:  3111 flags:0x00000004
May 13 12:15:32 dom0 kernel: Call Trace:
May 13 12:15:32 dom0 kernel:  <TASK>
May 13 12:15:32 dom0 kernel:  __schedule+0x222/0x620
May 13 12:15:32 dom0 kernel:  schedule+0x4e/0xc0
May 13 12:15:32 dom0 kernel:  schedule_timeout+0x119/0x150
May 13 12:15:32 dom0 kernel:  wait_for_completion+0xa3/0x100
May 13 12:15:32 dom0 kernel:  gnttab_unmap_refs_sync+0xc3/0xf0
May 13 12:15:32 dom0 kernel:  __unmap_grant_pages+0xb1/0x1f0 [xen_gntdev]
May 13 12:15:32 dom0 kernel:  ? gnttab_unmap_refs_async+0x60/0x60
May 13 12:15:32 dom0 kernel:  ? apply_wqattrs_cleanup.part.0+0xb0/0xb0
May 13 12:15:32 dom0 kernel:  ? gnttab_dma_free_pages+0xe0/0xe0
May 13 12:15:32 dom0 kernel:  gntdev_invalidate.part.0.isra.0+0x53/0xa0 [xen_gntdev]
May 13 12:15:32 dom0 kernel:  mn_itree_invalidate+0x73/0xc0
May 13 12:15:32 dom0 kernel:  __mmu_notifier_invalidate_range_start+0x44/0x50
May 13 12:15:32 dom0 kernel:  unmap_vmas+0xea/0x100
May 13 12:15:32 dom0 kernel:  unmap_region+0xbd/0x120
May 13 12:15:32 dom0 kernel:  __do_munmap+0x1f5/0x4e0
May 13 12:15:32 dom0 kernel:  __vm_munmap+0x75/0x120
May 13 12:15:32 dom0 kernel:  __x64_sys_munmap+0x17/0x20
May 13 12:15:32 dom0 kernel:  do_syscall_64+0x3b/0x90
May 13 12:15:32 dom0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 13 12:15:32 dom0 kernel: RIP: 0033:0x7bec959ee37b
May 13 12:15:32 dom0 kernel: RSP: 002b:00007ffc192ffaf8 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
May 13 12:15:32 dom0 kernel: RAX: ffffffffffffffda RBX: 00000000000001be RCX: 00007bec959ee37b
May 13 12:15:32 dom0 kernel: RDX: 00007ffc192ffb10 RSI: 00000000001be000 RDI: 00007bec765bc000
May 13 12:15:32 dom0 kernel: RBP: 00007bec765bc000 R08: 0000000000000008 R09: 0000000000000000
May 13 12:15:32 dom0 kernel: R10: 0000000000000017 R11: 0000000000000206 R12: 0000000000000009
May 13 12:15:32 dom0 kernel: R13: 00005669c536b128 R14: 0000000000000061 R15: 00005669c3840e00
May 13 12:15:32 dom0 kernel:  </TASK>
May 13 12:15:32 dom0 kernel: task:kworker/u8:1    state:D stack:    0 pid: 4865 ppid:     2 flags:0x00004000
May 13 12:15:32 dom0 kernel: Workqueue: i915 __i915_gem_free_work [i915]
May 13 12:15:32 dom0 kernel: Call Trace:
May 13 12:15:32 dom0 kernel:  <TASK>
May 13 12:15:32 dom0 kernel:  __schedule+0x222/0x620
May 13 12:15:32 dom0 kernel:  schedule+0x4e/0xc0
May 13 12:15:32 dom0 kernel:  mmu_interval_notifier_remove+0xec/0x1b0
May 13 12:15:32 dom0 kernel:  ? do_wait_intr_irq+0xa0/0xa0
May 13 12:15:32 dom0 kernel:  i915_gem_userptr_release+0x15/0x30 [i915]
May 13 12:15:32 dom0 kernel:  __i915_gem_free_object+0x4e/0x110 [i915]
May 13 12:15:32 dom0 kernel:  __i915_gem_free_objects+0xb7/0x120 [i915]
May 13 12:15:32 dom0 kernel:  process_one_work+0x1e5/0x3b0
May 13 12:15:32 dom0 kernel:  worker_thread+0x49/0x2e0
May 13 12:15:32 dom0 kernel:  ? rescuer_thread+0x3a0/0x3a0
May 13 12:15:32 dom0 kernel:  kthread+0xe7/0x110
May 13 12:15:32 dom0 kernel:  ? kthread_complete_and_exit+0x20/0x20
May 13 12:15:32 dom0 kernel:  ret_from_fork+0x22/0x30
May 13 12:15:32 dom0 kernel:  </TASK>

Move the mouse some more and then the system no longer responds to mouse movement; a new blocked task is InputThread:

May 13 12:16:06 dom0 kernel: task:InputThread     state:D stack:    0 pid: 3327 ppid:  3111 flags:0x00000000
May 13 12:16:06 dom0 kernel: Call Trace:
May 13 12:16:06 dom0 kernel:  <TASK>
May 13 12:16:06 dom0 kernel:  __schedule+0x222/0x620
May 13 12:16:06 dom0 kernel:  schedule+0x4e/0xc0
May 13 12:16:06 dom0 kernel:  rwsem_down_write_slowpath+0x20a/0x440
May 13 12:16:06 dom0 kernel:  down_write_killable+0x3b/0x50
May 13 12:16:06 dom0 kernel:  do_mprotect_pkey+0xc0/0x3b0
May 13 12:16:06 dom0 kernel:  __x64_sys_mprotect+0x1b/0x20
May 13 12:16:06 dom0 kernel:  do_syscall_64+0x3b/0x90
May 13 12:16:06 dom0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 13 12:16:06 dom0 kernel: RIP: 0033:0x7bec959ee3ab
May 13 12:16:06 dom0 kernel: RSP: 002b:00007bec83ffdbe8 EFLAGS: 00000206 ORIG_RAX: 000000000000000a
May 13 12:16:06 dom0 kernel: RAX: ffffffffffffffda RBX: 00007bec78000020 RCX: 00007bec959ee3ab
May 13 12:16:06 dom0 kernel: RDX: 0000000000000003 RSI: 0000000000001000 RDI: 00007bec78021000
May 13 12:16:06 dom0 kernel: RBP: 0000000000000c20 R08: 0000000000021000 R09: 0000000000022000
May 13 12:16:06 dom0 kernel: R10: 0000000000000c40 R11: 0000000000000206 R12: 0000000000000bb0
May 13 12:16:06 dom0 kernel: R13: 0000000000001000 R14: 00007bec78020450 R15: fffffffffffff000
May 13 12:16:06 dom0 kernel:  </TASK>

Above is for dom0 kernel-latest-5.17.7-1 from current-testing, today.

icequbes1 commented 2 years ago

Forgot to add: the lockup only occurs when resizing VM windows. Resizing dom0 windows doesn't cause a lockup.

DemiMarie commented 2 years ago

Looks like a kernel bug: if I understand the code and the stack trace correctly, i915 is waiting for xen_gntdev to finish its MMU notifier invalidation callback, but xen_gntdev is waiting for the unmapping of the pages to complete. That, in turn, cannot happen until i915 to drop its references to the pages. Result: deadlock.

I don’t see any reason for the invalidate callback to block on the pages actually being released back to Xen. I’ll send a patch to use the asynchronous version instead, which avoids having to wait on the pages not being used. As a bonus, the notifier can succeed even if called in a context where sleeping is not possible. Edit: This would apparently break AIO and Direct I/O to network filesystems for reasons I do not understand. That situation is not likely to happen in Qubes, but it still needs to be looked at before this can go upstream.

DemiMarie commented 2 years ago

@icequbes1 what name and email would you like me to use in the Reported-by?

DemiMarie commented 2 years ago

I am not marking this as “diagnosed” because there could be a deeper problem somewhere in the Linux kernel memory management system, or a bug in the i915 driver. However, QubesOS/qubes-linux-kernel#583 should at least avoid the hang.

icequbes1 commented 2 years ago

Message ID: @.***>The name "icequbes1" is fine, otherwise I don't mind any attribution.

I'll give the PR a try sometime over the next 2 days.

I have only seen this issue on my T460s machine. The T430 doesn't experience this. Both have integrated GPUs and use the i915 driver.

DemiMarie commented 2 years ago

Message ID: @.***>The name "icequbes1" is fine, otherwise I don't mind any attribution. I'll give the PR a try sometime over the next 2 days. I have only seen this issue on my T460s machine. The T430 doesn't experience this. Both have integrated GPUs and use the i915 driver.

Please ignore the PR for now; I don’t think it even compiles.

DemiMarie commented 2 years ago

This has been fixed in the 5.18 branch by https://github.com/gregkh/linux/commit/d4a49d20cd7cdb6bd075cd04c2cd00a7eba907ed and in the 5.15 branch by https://github.com/gregkh/linux/commit/87a54feba68f5e47925c8e49100db9b2a8add761. Patches for other LTS kernels have been accepted and should appear in the next release of those kernels. Once all Qubes-provided kernels have been patched, this issue can be closed.

loztcf commented 2 years ago

This issue persists with kernel 5.18.9-1 in dom0. I don't have this issue using modesetting driver which sadly leads to graphical artifacts/glitches instead of the freezing behaviour.

DemiMarie commented 2 years ago

@loztcf Can you use sysrq+w to get a stack trace of all blocked tasks?

loztcf commented 1 year ago

@DemiMarie actually I can't. Sysrq is fully enabled. I'm not even able to use Sysrq+r, +r or +o. System just doesn't respond after lockup. Only way left is a hard reset.

DemiMarie commented 1 year ago

@loztcf I suspect the Intel driver and the kernel driver are not getting along. The graphics artifacts with the modesetting driver should be fixed. Can you try the modesetting driver and see if that makes the problem go away?