Open chavafg opened 6 years ago
I can reproduce this issue in Clear Linux as well using kernel linux-kvm-4.13.1
. The vm hangs until you do hit enter by serial console.
[ 8243.712008] INFO: rcu_sched self-detected stall on CPU
[ 8243.714006] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 8243.714012] 2-...: (1 GPs behind) idle=e1e/140000000000000/0 softirq=416776/416777 fqs=27
[ 8243.714015] 3-...: (59999 ticks this GP) idle=fb6/140000000000001/0 softirq=365003/365003 fqs=27
[ 8243.714015] (detected by 0, t=60002 jiffies, g=198525, c=198524, q=13)
[ 8243.714020] Sending NMI from CPU 0 to CPUs 2:
[ 8243.714171] NMI backtrace for cpu 2
[ 8243.714175] CPU: 2 PID: 23292 Comm: qemu-lite-syste Not tainted 4.13.1-246.kvm #2
[ 8243.714176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
[ 8243.714177] task: ffffa3882ff87180 task.stack: ffffa60684398000
[ 8243.714185] RIP: 0010:vmx_vcpu_run+0x1f9/0x420
[ 8243.714186] RSP: 0018:ffffa6068439bcf8 EFLAGS: 00000002
[ 8243.714187] RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000000000
[ 8243.714188] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88007c8c8168
[ 8243.714189] RBP: ffff88007cbbbdf8 R08: ffff88007fc152a8 R09: 0000000000000000
[ 8243.714190] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007c860280
[ 8243.714191] R13: ffffffff81854610 R14: ffffffff81989828 R15: ffff88007c8c8000
[ 8243.714192] FS: 00007f635b111700(0000) GS:ffffa38839d00000(0000) knlGS:0000000000000000
[ 8243.714193] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8243.714194] CR2: 00007f0596337e90 CR3: 00000001aeb70000 CR4: 00000000001426a0
[ 8243.714199] Call Trace:
[ 8243.714206] ? vcpu_enter_guest+0x4d2/0x1090
[ 8243.714212] ? __schedule+0x255/0x7a0
[ 8243.714214] ? vmx_deliver_posted_interrupt+0xe7/0x100
[ 8243.714217] ? kvm_arch_vcpu_ioctl_run+0xe9/0x3f0
[ 8243.714220] ? kvm_vcpu_ioctl+0x2a7/0x4f0
[ 8243.714225] ? do_futex+0x2e9/0x4b0
[ 8243.714229] ? do_vfs_ioctl+0x8d/0x5b0
[ 8243.714231] ? kvm_on_user_return+0x6a/0xa0
[ 8243.714233] ? SyS_ioctl+0x74/0x80
[ 8243.714236] ? entry_SYSCALL_64_fastpath+0x1a/0xa5
[ 8243.714237] Code: a1 a0 02 00 00 4c 8b a9 a8 02 00 00 4c 8b b1 b0 02 00 00 4c 8b b9 b8 02 00 00 48 8b 89 48 02 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 <48> 89 4c 24 08 59 48 89 81 40 02 00 00 48 89 99 58 02 00 00 8f
[ 8243.715029] Sending NMI from CPU 0 to CPUs 3:
[ 8243.716035] NMI backtrace for cpu 3
[ 8243.716036] CPU: 3 PID: 32 Comm: ksmd Not tainted 4.13.1-246.kvm #2
[ 8243.716037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
[ 8243.716038] task: ffffa388309d50c0 task.stack: ffffa60680118000
[ 8243.716038] RIP: 0010:io_serial_out+0x11/0x20
[ 8243.716039] RSP: 0018:ffffa38839d83c70 EFLAGS: 00000002
[ 8243.716040] RAX: 0000000000000034 RBX: ffffffffb1df4d00 RCX: 0000000000000000
[ 8243.716041] RDX: 00000000000003f8 RSI: 0000000000000000 RDI: ffffffffb1df4d00
[ 8243.716041] RBP: ffffa38839d83c88 R08: 00000000000ae516 R09: 0000000000000004
[ 8243.716042] R10: 0000000000000000 R11: ffffffffb1da190d R12: 0000000000000034
[ 8243.716042] R13: ffffffffb1da193e R14: ffffffffb1da1905 R15: ffffffffb1df4d00
[ 8243.716043] FS: 0000000000000000(0000) GS:ffffa38839d80000(0000) knlGS:0000000000000000
[ 8243.716043] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8243.716044] CR2: 00007f91e7adb490 CR3: 0000000107ca5000 CR4: 00000000001426a0
[ 8243.716044] Call Trace:
[ 8243.716045] <IRQ>
[ 8243.716045] ? serial8250_console_putchar+0x22/0x30
[ 8243.716046] ? wait_for_xmitr+0x90/0x90
[ 8243.716046] uart_console_write+0x43/0x60
[ 8243.716046] serial8250_console_write+0xeb/0x260
[ 8243.716047] ? msg_print_text+0x76/0x100
[ 8243.716047] univ8250_console_write+0x23/0x30
[ 8243.716048] console_unlock+0x2f0/0x450
[ 8243.716048] vprintk_emit+0x2f5/0x350
[ 8243.716048] vprintk_default+0x1a/0x20
[ 8243.716049] vprintk_func+0x22/0x60
[ 8243.716049] printk+0x43/0x45
[ 8243.716050] rcu_check_callbacks+0x466/0x840
[ 8243.716050] ? account_system_index_time+0x5e/0x70
[ 8243.716050] ? account_system_time+0x4b/0x60
[ 8243.716051] update_process_times+0x2a/0x50
[ 8243.716051] tick_sched_handle.isra.5+0x30/0x40
[ 8243.716052] tick_sched_timer+0x39/0x80
[ 8243.716052] __hrtimer_run_queues+0xaf/0x110
[ 8243.716052] hrtimer_interrupt+0x9b/0x1c0
[ 8243.716053] smp_apic_timer_interrupt+0x5e/0x90
[ 8243.716053] apic_timer_interrupt+0x8e/0xa0
[ 8243.716054] RIP: 0010:queued_spin_lock_slowpath+0x1b/0x190
[ 8243.716054] RSP: 0018:ffffa6068011bcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[ 8243.716055] RAX: 0000000000000001 RBX: ffffa38707c190f0 RCX: 0000000000000000
[ 8243.716056] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa38707c10000
[ 8243.716056] RBP: ffffa6068011bcd8 R08: 0000000000000000 R09: ffffa38727953010
[ 8243.716057] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa38707c10000
[ 8243.716057] R13: ffffa38707c19140 R14: 0000000000000000 R15: 00007f63404b9000
[ 8243.716058] </IRQ>
[ 8243.716058] ? _raw_spin_lock+0x1b/0x20
[ 8243.716058] kvm_mmu_notifier_invalidate_range_start+0x39/0x80
[ 8243.716059] __mmu_notifier_invalidate_range_start+0x50/0x80
[ 8243.716059] try_to_merge_one_page+0x53f/0x750
[ 8243.716060] ? uncharge_list+0x10f/0x120
[ 8243.716060] try_to_merge_with_ksm_page+0x49/0x90
[ 8243.716061] ksm_do_scan+0x673/0xe10
[ 8243.716061] ksm_scan_thread+0x196/0x1d0
[ 8243.716062] ? wait_woken+0x80/0x80
[ 8243.716062] kthread+0x11a/0x130
[ 8243.716062] ? ksm_do_scan+0xe10/0xe10
[ 8243.716063] ? kthread_create_on_node+0x40/0x40
[ 8243.716064] ret_from_fork+0x25/0x30
[ 8243.716064] Code: 00 8b 57 08 d3 e6 01 f2 ec 0f b6 c0 c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f b6 8f a1 00 00 00 89 d0 8b 57 08 d3 e6 01 f2 ee <c3> 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f b6 87 a2 00 00
[ 8243.716078] rcu_sched kthread starved for 59846 jiffies! g198525 c198524 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0
[ 8243.716080] rcu_sched R running task 0 8 2 0x00000000
[ 8243.716082] Call Trace:
[ 8243.716090] __schedule+0x24d/0x7a0
[ 8243.716093] schedule+0x31/0x80
[ 8243.716095] schedule_timeout+0x14e/0x290
[ 8243.716098] ? del_timer_sync+0x40/0x40
[ 8243.716100] rcu_gp_kthread+0x52c/0x8d0
[ 8243.716104] kthread+0x11a/0x130
[ 8243.716106] ? note_gp_changes+0xa0/0xa0
[ 8243.716109] ? kthread_create_on_node+0x40/0x40
[ 8243.716111] ret_from_fork+0x25/0x30
[ 8243.726944] 3-...: (59999 ticks this GP) idle=fb6/140000000000001/0 softirq=365003/365003 fqs=28
[ 8243.728575] (t=60016 jiffies g=198525 c=198524 q=705)
[ 8243.729534] NMI backtrace for cpu 3
[ 8243.729538] CPU: 3 PID: 32 Comm: ksmd Not tainted 4.13.1-246.kvm #2
[ 8243.729540] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
[ 8243.729541] Call Trace:
[ 8243.729543] <IRQ>
[ 8243.729550] dump_stack+0x63/0x82
[ 8243.729554] nmi_cpu_backtrace+0xbe/0xc0
[ 8243.729571] ? irq_force_complete_move+0xe0/0xe0
[ 8243.729574] nmi_trigger_cpumask_backtrace+0x91/0xc0
[ 8243.729577] arch_trigger_cpumask_backtrace+0x14/0x20
[ 8243.729581] rcu_dump_cpu_stacks+0x9f/0xd8
[ 8243.729583] rcu_check_callbacks+0x6ed/0x840
[ 8243.729588] ? account_system_index_time+0x5e/0x70
[ 8243.729590] ? account_system_time+0x4b/0x60
[ 8243.729594] update_process_times+0x2a/0x50
[ 8243.729598] tick_sched_handle.isra.5+0x30/0x40
[ 8243.729600] tick_sched_timer+0x39/0x80
[ 8243.729604] __hrtimer_run_queues+0xaf/0x110
[ 8243.729607] hrtimer_interrupt+0x9b/0x1c0
[ 8243.729612] smp_apic_timer_interrupt+0x5e/0x90
[ 8243.729616] apic_timer_interrupt+0x8e/0xa0
[ 8243.729621] RIP: 0010:queued_spin_lock_slowpath+0x1b/0x190
[ 8243.729623] RSP: 0018:ffffa6068011bcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[ 8243.729625] RAX: 0000000000000001 RBX: ffffa38707c190f0 RCX: 0000000000000000
[ 8243.729627] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa38707c10000
[ 8243.729628] RBP: ffffa6068011bcd8 R08: 0000000000000000 R09: ffffa38727953010
[ 8243.729629] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa38707c10000
[ 8243.729630] R13: ffffa38707c19140 R14: 0000000000000000 R15: 00007f63404b9000
[ 8243.729632] </IRQ>
[ 8243.729635] ? _raw_spin_lock+0x1b/0x20
[ 8243.729640] kvm_mmu_notifier_invalidate_range_start+0x39/0x80
[ 8243.729646] __mmu_notifier_invalidate_range_start+0x50/0x80
[ 8243.729648] try_to_merge_one_page+0x53f/0x750
[ 8243.729652] ? uncharge_list+0x10f/0x120
[ 8243.729654] try_to_merge_with_ksm_page+0x49/0x90
[ 8243.729656] ksm_do_scan+0x673/0xe10
[ 8243.729658] ksm_scan_thread+0x196/0x1d0
[ 8243.729662] ? wait_woken+0x80/0x80
[ 8243.729668] kthread+0x11a/0x130
[ 8243.729670] ? ksm_do_scan+0xe10/0xe10
[ 8243.729673] ? kthread_create_on_node+0x40/0x40
[ 8243.729676] ret_from_fork+0x25/0x30
Can we disable KSM and try this test again please.
@sameo @chavafg cc-proxy -ksm off
@chavafg @jcvenegas next time that happens can you also do the following
echo 1 >/sys/kernel/debug/tracing/events/kvm/enable
cat /sys/kernel/debug/tracing/trace_pipe
dump the pipe a few times
Also dump dmesg on the host
I will let the tests running again during the night with ksm off, will let you know what happens.
Ran the tests with ksm off
in a loop 100 times and the issue was not reproducible.
Also, seems like this only happens on a nested virtualization environment. I ran the tests also in Baremetal and couldn't reproduce even with ksm enabled.
Description of problem
After executing docker integration tests from https://github.com/clearcontainers/tests for around 2 hours on a Fedora 26 VM, I got kernel errors and the VM seems to be hanged, although the same errors (which appear below) keep being posted on the console.
Expected result
There shouldn't be any kernel error. The VM should not hang.
Actual result
There are some errors related to
ksm
andqemu-lite
Logs from
cc-collect-data.sh
collected_data.logRecent errors on cc-proxy journal log: