FIx/workaround vchan "loopback" connection

marmarek commented 9 years ago

Reported by marmarek on 9 Feb 2015 21:15 UTC Currently Xen implementation of vchan in R3 crashes when connection is made back to the source domain. This is apparently not supported by xen-gntalloc driver.

The exact message is:

[    9.937990] BUG: Bad page map in process qrexec-agent  pte:80000000f9d41167 pmd:131c3067
[    9.938010] page:ffffea00036a6638 count:1 mapcount:-1 mapping:          (null) index:0xffffffffffffffff
[    9.938018] page flags: 0x4000000000000c14(referenced|dirty|reserved|private)
[    9.938033] addr:00007fa856d47000 vm_flags:140400fb anon_vma:          (null) mapping:ffff880011efe940 index:11
[    9.938042] vma->vm_ops->fault:           (null)
[    9.938057] vma->vm_file->f_op->mmap: gntalloc_mmap+0x0/0x1c0 [   9.938066](xen_gntalloc]
[) CPU: 0 PID: 1108 Comm: qrexec-agent Tainted: G           O 3.12.23-1.pvops.qubes.x86_64 #1
[    9.938074]  ffff8800131f3818 ffff88001316fc78 ffffffff814db550 00007fa856d47000
[    9.938085]  ffff88001316fcb8 ffffffff81139413 ffff880011efe940 ffff8800131c3a38
[    9.938096]  ffffea00036a6638 00007fa856d47000 00007fa856d57000 ffff88001316fe18
[    9.938107] Call Trace:
[    9.938117]  [dump_stack+0x45/0x56
[    9.938126](<ffffffff814db550>])  [print_bad_pte+0x1a3/0x240
[    9.938133](<ffffffff81139413>])  [unmap_page_range+0x6ee/0x7d0
[    9.938142](<ffffffff8113ac9e>])  [unmap_single_vma+0x76/0xa0
[    9.938149](<ffffffff8113adf6>])  [unmap_vmas+0x49/0x90
[    9.938157](<ffffffff8113be09>])  [exit_mmap+0x9c/0x170
[    9.938166](<ffffffff8114443c>])  [mmput+0x5c/0x110
[    9.938175](<ffffffff8105950c>])  [do_exit+0x27c/0xa20
[    9.938184](<ffffffff8105d74c>])  [? vtime_account_user+0x4f/0x60
[    9.938194](<ffffffff810908ef>])  [? context_tracking_user_exit+0x52/0xc0
[    9.938203](<ffffffff81116502>])  [do_group_exit+0x3a/0xa0
[    9.938211](<ffffffff8105ed2a>])  [SyS_exit_group+0xf/0x10
[    9.938220](<ffffffff8105ed9f>])  [<ffffffff814ea907>] tracesys+0xdd/0xe2

Needs either fix in the kernel, or some special case in vchan-xen code (use simple shm instead of Xen shared memory?).

Migrated-From: https://wiki.qubes-os.org/ticket/951

marmarek commented 9 years ago

Currently vchan reject to open such connection, so at least there is no kernel crash. But still, it would be nice to fix this properly.

esote commented 5 years ago

@marmarek Is this still an issue considering both Xen and Qubes have had newer releases since then? You say it might need a fix in the kernel, do you mean the Xen microkernel?

marmarek commented 5 years ago

@esote Yes, this is still an issue, exactly as originally described. Just checked on Xen 4.12-unstable and Linux 4.14.74. The one possible fix would be in Linux kernel, but not sure if that's the right thing to do.

BTW Many thanks @esote for reviewing and cleaning up old issues!

esote commented 5 years ago

Yep, no problem. I usually don't have time to dive into code outside of college and work, so I figured cleaning up issues is the least I could do.

For patching the Linux kernel, how likely would it be to end up in a long term release (4.14 or 4.19) -- or would it be a patch only for Qubes' kernel?

I haven't looked at the vchan code, so I flipped a coin, assigning "kernel" as heads and "vchan" as tails, and it landed on heads. If that helps, because otherwise my input would be essentially a coin flip.

marmarek commented 5 years ago

Exact message from 4.14.74 kernel:

[1332916.029255] BUG: unable to handle kernel paging request at ffff880850d7b008
[1332916.029290] IP: __tlb_remove_page_size+0x29/0xc0
[1332916.029306] PGD 2a75067 P4D 2a75067 PUD 0 
[1332916.029325] Oops: 0002 [#1] SMP PTI
[1332916.029339] Modules linked in: fuse ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c xen_netfront intel_rapl crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr intel_rapl_perf u2mfn(O) xen_gntdev xen_gntalloc xenfs xen_blkback xen_privcmd xen_evtchn xen_blkfront
[1332916.029457] CPU: 0 PID: 23450 Comm: strace Tainted: G           O    4.14.74-1.pvops.qubes.x86_64 #1
[1332916.029484] task: ffff880033f3bc80 task.stack: ffffc90003688000
[1332916.029506] RIP: 0010:__tlb_remove_page_size+0x29/0xc0
[1332916.029523] RSP: 0018:ffffc9000368bca0 EFLAGS: 00010246
[1332916.029540] RAX: ffff880050d7b000 RBX: ffffc9000368bdd0 RCX: 0000000000000000
[1332916.029563] RDX: 00000000ffffffff RSI: ffffea0002ca7c00 RDI: ffffc9000368bdd0
[1332916.029587] RBP: ffffea0002ca7c00 R08: 00000000000247e0 R09: ffff88009a493898
[1332916.029610] R10: 00000000000fa000 R11: 0000000000000001 R12: 000058c11e2a0000
[1332916.029634] R13: 000058c11e2a1000 R14: ffffc9000368bdd0 R15: ffffea0002ca7c00
[1332916.029658] FS:  0000000000000000(0000) GS:ffff8800f9c00000(0000) knlGS:0000000000000000
[1332916.029682] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1332916.029702] CR2: ffff880850d7b008 CR3: 000000000220a006 CR4: 00000000001606f0
[1332916.029728] Call Trace:
[1332916.029742]  unmap_page_range+0x86c/0xc50
[1332916.029757]  unmap_vmas+0x4c/0xa0
[1332916.029772]  exit_mmap+0xb5/0x1c0
[1332916.029787]  mmput+0x5f/0x140
[1332916.029801]  do_exit+0x288/0xbb0
[1332916.029816]  ? __audit_syscall_entry+0xae/0x100
[1332916.029834]  ? syscall_trace_enter+0x1ae/0x2c0
[1332916.029851]  do_group_exit+0x3a/0xa0
[1332916.029865]  SyS_exit_group+0x10/0x10
[1332916.029879]  do_syscall_64+0x74/0x180
[1332916.035058]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[1332916.035077] RIP: 0033:0x7fa9a2c23a26
[1332916.035090] RSP: 002b:00007ffcbaf5dbe8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[1332916.035114] RAX: ffffffffffffffda RBX: 00007fa9a2d16740 RCX: 00007fa9a2c23a26
[1332916.035138] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[1332916.035161] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
[1332916.035185] R10: 00007ffcbaf5da74 R11: 0000000000000246 R12: 00007fa9a2d16740
[1332916.035210] R13: 0000000000000001 R14: 00007fa9a2d1f448 R15: 0000000000000000
[1332916.035234] Code: 00 00 0f 1f 44 00 00 48 83 7f 18 00 74 4a 55 53 39 97 84 00 00 00 75 42 48 8b 47 28 48 89 f5 48 89 fb 8b 50 08 8d 4a 01 89 48 08 <48> 89 74 d0 10 8b 50 0c 39 d1 74 09 31 c0 39 ca 72 21 5b 5d c3 
[1332916.035318] RIP: __tlb_remove_page_size+0x29/0xc0 RSP: ffffc9000368bca0
[1332916.035338] CR2: ffff880850d7b008
[1332916.035353] ---[ end trace b871d7772ace7b61 ]---
[1332916.035370] Kernel panic - not syncing: Fatal exception
[1332916.035576] Kernel Offset: disabled

DemiMarie commented 4 years ago

Should we see if we can get this patched upstream? This is clearly a kernel and/or Xen bug.

github-actions[bot] commented 1 year ago

This issue is being closed because:

This issue is on the "Release 4.0 updates" milestone.
Qubes OS 4.0 reached EOL (end-of-life) over one year ago.
There has not been any activity on this issue in over one year.

If anyone believes that this issue should be reopened and reassigned to an active milestone, please leave a brief comment. (For example, if a bug still affects Qubes OS 4.1, then the comment "Affects 4.1" will suffice.)

UndeadDevel commented 1 year ago

This should probably be reopened, given this comment by marmarek. See also my comment below his in that issue for why it's useful to have the ability to copy from and to the same VM, requiring these "loopback" connections. Another case where an issue (fixable, but only with an additional policy) is caused by this can be found here.

The reason for closing was EOL of 4.0, but issue still occurs in 4.1, which is not EOL.