Open marmarek opened 9 years ago
Currently vchan reject to open such connection, so at least there is no kernel crash. But still, it would be nice to fix this properly.
@marmarek Is this still an issue considering both Xen and Qubes have had newer releases since then? You say it might need a fix in the kernel, do you mean the Xen microkernel?
@esote Yes, this is still an issue, exactly as originally described. Just checked on Xen 4.12-unstable and Linux 4.14.74. The one possible fix would be in Linux kernel, but not sure if that's the right thing to do.
BTW Many thanks @esote for reviewing and cleaning up old issues!
Yep, no problem. I usually don't have time to dive into code outside of college and work, so I figured cleaning up issues is the least I could do.
For patching the Linux kernel, how likely would it be to end up in a long term release (4.14 or 4.19) -- or would it be a patch only for Qubes' kernel?
I haven't looked at the vchan code, so I flipped a coin, assigning "kernel" as heads and "vchan" as tails, and it landed on heads. If that helps, because otherwise my input would be essentially a coin flip.
Exact message from 4.14.74 kernel:
[1332916.029255] BUG: unable to handle kernel paging request at ffff880850d7b008
[1332916.029290] IP: __tlb_remove_page_size+0x29/0xc0
[1332916.029306] PGD 2a75067 P4D 2a75067 PUD 0
[1332916.029325] Oops: 0002 [#1] SMP PTI
[1332916.029339] Modules linked in: fuse ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c xen_netfront intel_rapl crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr intel_rapl_perf u2mfn(O) xen_gntdev xen_gntalloc xenfs xen_blkback xen_privcmd xen_evtchn xen_blkfront
[1332916.029457] CPU: 0 PID: 23450 Comm: strace Tainted: G O 4.14.74-1.pvops.qubes.x86_64 #1
[1332916.029484] task: ffff880033f3bc80 task.stack: ffffc90003688000
[1332916.029506] RIP: 0010:__tlb_remove_page_size+0x29/0xc0
[1332916.029523] RSP: 0018:ffffc9000368bca0 EFLAGS: 00010246
[1332916.029540] RAX: ffff880050d7b000 RBX: ffffc9000368bdd0 RCX: 0000000000000000
[1332916.029563] RDX: 00000000ffffffff RSI: ffffea0002ca7c00 RDI: ffffc9000368bdd0
[1332916.029587] RBP: ffffea0002ca7c00 R08: 00000000000247e0 R09: ffff88009a493898
[1332916.029610] R10: 00000000000fa000 R11: 0000000000000001 R12: 000058c11e2a0000
[1332916.029634] R13: 000058c11e2a1000 R14: ffffc9000368bdd0 R15: ffffea0002ca7c00
[1332916.029658] FS: 0000000000000000(0000) GS:ffff8800f9c00000(0000) knlGS:0000000000000000
[1332916.029682] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1332916.029702] CR2: ffff880850d7b008 CR3: 000000000220a006 CR4: 00000000001606f0
[1332916.029728] Call Trace:
[1332916.029742] unmap_page_range+0x86c/0xc50
[1332916.029757] unmap_vmas+0x4c/0xa0
[1332916.029772] exit_mmap+0xb5/0x1c0
[1332916.029787] mmput+0x5f/0x140
[1332916.029801] do_exit+0x288/0xbb0
[1332916.029816] ? __audit_syscall_entry+0xae/0x100
[1332916.029834] ? syscall_trace_enter+0x1ae/0x2c0
[1332916.029851] do_group_exit+0x3a/0xa0
[1332916.029865] SyS_exit_group+0x10/0x10
[1332916.029879] do_syscall_64+0x74/0x180
[1332916.035058] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[1332916.035077] RIP: 0033:0x7fa9a2c23a26
[1332916.035090] RSP: 002b:00007ffcbaf5dbe8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[1332916.035114] RAX: ffffffffffffffda RBX: 00007fa9a2d16740 RCX: 00007fa9a2c23a26
[1332916.035138] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[1332916.035161] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
[1332916.035185] R10: 00007ffcbaf5da74 R11: 0000000000000246 R12: 00007fa9a2d16740
[1332916.035210] R13: 0000000000000001 R14: 00007fa9a2d1f448 R15: 0000000000000000
[1332916.035234] Code: 00 00 0f 1f 44 00 00 48 83 7f 18 00 74 4a 55 53 39 97 84 00 00 00 75 42 48 8b 47 28 48 89 f5 48 89 fb 8b 50 08 8d 4a 01 89 48 08 <48> 89 74 d0 10 8b 50 0c 39 d1 74 09 31 c0 39 ca 72 21 5b 5d c3
[1332916.035318] RIP: __tlb_remove_page_size+0x29/0xc0 RSP: ffffc9000368bca0
[1332916.035338] CR2: ffff880850d7b008
[1332916.035353] ---[ end trace b871d7772ace7b61 ]---
[1332916.035370] Kernel panic - not syncing: Fatal exception
[1332916.035576] Kernel Offset: disabled
Should we see if we can get this patched upstream? This is clearly a kernel and/or Xen bug.
This issue is being closed because:
If anyone believes that this issue should be reopened and reassigned to an active milestone, please leave a brief comment. (For example, if a bug still affects Qubes OS 4.1, then the comment "Affects 4.1" will suffice.)
This should probably be reopened, given this comment by marmarek. See also my comment below his in that issue for why it's useful to have the ability to copy from and to the same VM, requiring these "loopback" connections. Another case where an issue (fixable, but only with an additional policy) is caused by this can be found here.
The reason for closing was EOL of 4.0, but issue still occurs in 4.1, which is not EOL.
On R4.2 it doesn’t crash anymore! This is with kernel 6.1.62-1.qubes.fc37.x86_64
.
Update:
libvchan_client_init_async
does get called.qrexec-client-vm
and the fork server both deadlock in xenevtchn_pending
which is ultimately called by libvchan_recv
. This is despite libvchan_send
having been called. I suspect a kernel or hypervisor bug is preventing the event channel from being signalled.Nope, event channels aren’t the problem — I think (but am not certain) that something related to grants is. I managed to trigger a crash by killing qrexec-fork-server
with a loopback vchan open.
What if loopback connections were automatically handled via libvchan-socket instead of libvchan-xen?
Marking as https://github.com/QubesOS/qubes-issues/labels/C%3A%20kernel because this is almost certainly a Linux kernel bug.
Reported by marmarek on 9 Feb 2015 21:15 UTC Currently Xen implementation of vchan in R3 crashes when connection is made back to the source domain. This is apparently not supported by xen-gntalloc driver.
The exact message is:
Needs either fix in the kernel, or some special case in vchan-xen code (use simple shm instead of Xen shared memory?).
Migrated-From: https://wiki.qubes-os.org/ticket/951