KVM-VMI / kvm-vmi

KVM-based Virtual Machine Introspection
https://kvm-vmi.github.io/kvm-vmi/master/
301 stars 61 forks source link

Linux 4.9: unable to handle kernel NULL pointer dereference #5

Closed Wenzel closed 7 years ago

Wenzel commented 7 years ago

Currently, the branch linux-vmi is based on the latest stable linux kernel v4.9.

However running nitro on this kernel causes the kernel to crash because of a kernel NULL pointer dereference, triggered at the end of the execution.

The stack trace shown in dmesg is the following:

[  746.252527] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[  746.254193] IP: [<ffffffff885f7aeb>] down_write+0x1b/0x40
[  746.255588] PGD 0 
[  746.256013] Oops: 0002 [#1] SMP
[  746.256013] Modules linked in: kvm_intel(OE) kvm(OE) irqbypass tun rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cirrus aesni_intel ttm aes_x86_64 lrw gf128mul snd_pcm glue_helper ablk_helper cryptd drm_kms_helper snd_timer snd ppdev soundcore pcspkr evdev drm acpi_cpufreq serio_raw virtio_balloon parport_pc parport tpm_tis tpm_tis_core tpm button i2c_piix4 autofs4 ext4 crc16 jbd2 fscrypto mbcache dm_mod ata_generic virtio_net virtio_blk ata_piix libata uhci_hcd floppy virtio_pci
[  746.256013]  virtio_ring ehci_hcd crc32c_intel psmouse virtio usbcore scsi_mod [last unloaded: irqbypass]
[  746.256013] CPU: 0 PID: 1685 Comm: python3 Tainted: G           OE   4.9.0-nitro+ #1
[  746.256013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
[  746.256013] task: ffff973ff64900c0 task.stack: ffffbcca8098c000
[  746.256013] RIP: 0010:[<ffffffff885f7aeb>]  [<ffffffff885f7aeb>] down_write+0x1b/0x40
[  746.256013] RSP: 0018:ffffbcca8098fdf8  EFLAGS: 00010286
[  746.256013] RAX: 00000000000000a8 RBX: 00000000000000a8 RCX: 0000000000000001
[  746.256013] RDX: ffffffff00000001 RSI: ffff973ff6426d00 RDI: 00000000000000a8
[  746.256013] RBP: ffff973ff88b7cc0 R08: 0000000000000000 R09: 0000000000000000
[  746.256013] R10: ffff973ff6426d10 R11: 0000000000000000 R12: ffff973ff88b7d60
[  746.256013] R13: ffff973ff88b7cc0 R14: ffff973f485f9900 R15: ffff973ffa0cd490
[  746.256013] FS:  00007fa658c20700(0000) GS:ffff973fffc00000(0000) knlGS:0000000000000000
[  746.256013] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  746.256013] CR2: 00000000000000a8 CR3: 0000000126407000 CR4: 00000000001426f0
[  746.256013] Stack:
[  746.256013]  ffff973ff88b7d18 ffffffff8828b294 ffff973ff9c10040 0000000000000008
[  746.256013]  ffff973ffa0cd490 ffff973ffaae9c60 ffff973f485f9900 ffffffffc08f2279
[  746.256013]  ffff973ff6426d00 ffffffff8820535d ffff973ff6426d10 ffff973ff64900c0
[  746.256013] Call Trace:
[  746.256013]  [<ffffffff8828b294>] ? debugfs_remove_recursive+0x44/0x190
[  746.256013]  [<ffffffffc08f2279>] ? kvm_vcpu_release+0x19/0x30 [kvm]
[  746.256013]  [<ffffffff8820535d>] ? __fput+0xcd/0x1e0
[  746.256013]  [<ffffffff88095aa2>] ? task_work_run+0x72/0x90
[  746.256013]  [<ffffffff8807bd65>] ? do_exit+0x395/0xb50
[  746.256013]  [<ffffffff8807c599>] ? do_group_exit+0x39/0xb0
[  746.256013]  [<ffffffff8807c620>] ? SyS_exit_group+0x10/0x10
[  746.256013]  [<ffffffff885fa1bb>] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[  746.256013] Code: 01 74 08 48 c7 43 20 01 00 00 00 5b c3 0f 1f 00 0f 1f 44 00 00 53 48 89 fb e8 62 df ff ff 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 47 07 d4 ff 65 48 8b 04 25 40 d4 
[  746.256013] RIP  [<ffffffff885f7aeb>] down_write+0x1b/0x40
[  746.256013]  RSP <ffffbcca8098fdf8>
[  746.256013] CR2: 00000000000000a8
[  746.256013] ---[ end trace a9571e43f9635c6a ]---
[  746.256013] Fixing recursive fault but reboot is needed!

The problem comes from the new KVM_DEBUG_FS feature, which was absent of the latest stable kernel for nitro (v4.5.x)

This is the debugfs_remove_recursive function:

647 void debugfs_remove_recursive(struct dentry *dentry)
648 {
649         struct dentry *child, *parent;
650 
651         if (IS_ERR_OR_NULL(dentry))
652                 return;
653 
654         parent = dentry;
655  down:
656         inode_lock(d_inode(parent));
657  loop:
658         /*
659          * The parent->d_subdirs is protected by the d_lock. Outside that
660          * lock, the child can be unlinked and set to be freed which can
661          * use the d_u.d_child as the rcu head and corrupt this list.
662          */
663         spin_lock(&parent->d_lock);
664         list_for_each_entry(child, &parent->d_subdirs, d_child) {
665                 if (!simple_positive(child))
666                         continue;
667 
668                 /* perhaps simple_empty(child) makes more sense */
669                 if (!list_empty(&child->d_subdirs)) {
670                         spin_unlock(&parent->d_lock);
671                         inode_unlock(d_inode(parent));
672                         parent = child;
673                         goto down;
674                 }
675 
676                 spin_unlock(&parent->d_lock);
677 
678                 if (!__debugfs_remove(child, parent))
679                         simple_release_fs(&debugfs_mount, &debugfs_mount_count);
680 
681                 /*
682                  * The parent->d_lock protects agaist child from unlinking
683                  * from d_subdirs. When releasing the parent->d_lock we can
684                  * no longer trust that the next pointer is valid.
685                  * Restart the loop. We'll skip this one with the
686                  * simple_positive() check.
687                  */
688                 goto loop;
689         }
690         spin_unlock(&parent->d_lock);
691 
692         inode_unlock(d_inode(parent));
693         child = parent;
694         parent = parent->d_parent;
695         inode_lock(d_inode(parent));
696 
697         if (child != dentry)
698                 /* go up */
699                 goto loop;
700 
701         if (!__debugfs_remove(child, parent))
702                 simple_release_fs(&debugfs_mount, &debugfs_mount_count);
703         inode_unlock(d_inode(parent));
704 
705         synchronize_srcu(&debugfs_srcu);
706 }
707 EXPORT_SYMBOL_GPL(debugfs_remove_recursive);

And apparently, during a call to inode_lock, it makes a call to down_write with a bad pointer

739 static inline void inode_lock(struct inode *inode)
740 {
741         down_write(&inode->i_rwsem);
742 }

Of course, this problem appears only at the end of a nitro run, be it when main.py or test_nitro.py calls their at_exit cleanup.

One thing that troubles me is that i check my kernel configuration, and the symbol KVM_DEBUG_FS was not enabled N...

Any ideas ?

Wenzel commented 7 years ago

So i found a solution by disabling every call to debugfs in kvm/virt/kvm_main.c

For an unknown reason, the config option CONFIG_KVM_DEBUG_FS is available in the kernel, but never used anywhere in the code. the call to debugfs_* function in the kvm code are mandatory.

The solution was to make these calls optional. See this commit which fixes the problem: really disable KVM debug fs if CONFIG_KVM_DEBUG_FS is unset