ememos / GiantVM

9 stars 9 forks source link

Kernel Oops report #13

Open YWHyuk opened 2 years ago

YWHyuk commented 2 years ago

When starting GiantVM, It failed to connect each other and I got Oops.

dell-3:~/Tutorial/guest_image$ ./run_gvm.sh -c 8 -m 8192 -s 4 -l 4 -i "10.10.20.53 10.10.20.54"
CPU Info
Total: 8
Local: 4 [4-7]
Remote: 4[ 0 1 2 3 ]
KVM API version[12], QEMU version[12]
WARNING: Image format was not specified for 'user-data.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
CPU 0 is remote CPU, pause
CPU 1 is remote CPU, pause
CPU 2 is remote CPU, pause
CPU 3 is remote CPU, pause
start kvm dsm server, total memory size: 8589934592
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
error: msi send vector in range 0-15
error: cannot find current apic
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
QEMU nums: 2, Total CPU nums: 8, CPU per QEMU: 4
connect_io_router...
QEMU 0 wait for RDMA connection on 10.10.20.54:40004
RDMA ERROR: Error: could not rdma_bind_addr!
RDMA init fail on 10.10.20.54:40004
RDMA ERROR: Error: could not rdma_bind_addr!
RDMA init fail on 10.10.20.54:40005
QEMU 0 wait for RDMA connection on 10.10.20.54:40005
source_resolve_host RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (1) Infiniband
rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory
RDMA ERROR: connecting to destination!
RDMA connect to 10.10.20.53:40002 fail, retrying...
source_resolve_host RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (1) Infiniband
rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory
RDMA ERROR: connecting to destination!
RDMA connect to 10.10.20.53:40002 fail, retrying...
source_resolve_host RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (1) Infiniband
rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory
RDMA ERROR: connecting to destination!
RDMA connect to 10.10.20.53:40002 fail, retrying...
source_resolve_host RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (1) Infiniband
dell-4:~/Tutorial/guest_image$ ./run_gvm.sh -c 8 -m 8192 -s 0 -l 4 -i "10.10.20.53 10.10.20.54"
qemu-system-x86_64: -redir tcp:5556::22: The -redir option is deprecated. Please use '-netdev user,hostfwd=...' instead.
CPU Info
Total: 8
Local: 4 [0-3]
Remote: 4[ 4 5 6 7 ]
KVM API version[12], QEMU version[12]
WARNING: Image format was not specified for 'user-data.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
CPU 4 is remote CPU, pause
CPU 5 is remote CPU, pause
CPU 6 is remote CPU, pause
CPU 7 is remote CPU, pause
start kvm dsm server, total memory size: 8589934592
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
error: msi send vector in range 0-15
error: cannot find current apic
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
QEMU nums: 2, Total CPU nums: 8, CPU per QEMU: 4
connect_io_router...
QEMU 1 wait for RDMA connection on 10.10.20.53:40002
RDMA ERROR: Error: could not rdma_bind_addr!
RDMA init fail on 10.10.20.53:40002
RDMA ERROR: Error: could not rdma_bind_addr!
RDMA init fail on 10.10.20.53:40003
QEMU 1 wait for RDMA connection on 10.10.20.53:40003
source_resolve_host RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (1) Infiniband
rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory
RDMA ERROR: connecting to destination!
RDMA connect to 10.10.20.54:40004 fail, retrying...
source_resolve_host RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (1) Infiniband
rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect: No such file or directory
RDMA ERROR: connecting to destination!
RDMA connect to 10.10.20.54:40004 fail, retrying...
source_resolve_host RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (1) Infiniband
dell-3 $ dmesg
[13710.960715] kvm_dsm_init: Enable kvm dsm mode, this kvm instance will be node-1
[13710.960717] kvm_dsm_init: kvm_dsm_init: kvm 1 use RDMA connection
[13710.960834] rdma_bind_addr failed, ret -99
[13803.984311] kvm-dsm: node-1 stopping dsm server
[13803.984319] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[13804.033143] PGD 0 P4D 0
[13804.081451] Oops: 0002 [#1] SMP PTI
[13804.129920] CPU: 6 PID: 2229 Comm: qemu-vcpu/4 Not tainted 4.18.20-gvm1 #1
[13804.179828] Hardware name: Dell Inc. PowerEdge R530/0CN7X8, BIOS 2.7.1 01/26/2018
[13804.231376] RIP: 0010:wait_for_completion+0x8c/0x150
[13804.283866] Code: 00 48 c7 44 24 10 b0 45 eb 86 c7 04 24 01 00 00 00 49 89 54 24 18 48 bd ff ff ff ff ff ff ff 7f 48 89 4c 24 18 48 89 44 24 20 <48> 89 10 eb 05 48 85 ed 74 3d 65 48 8b 04 25 00 5c 01 00 48 89 df
[13804.400531] RSP: 0018:ffffa8164950bbc0 EFLAGS: 00010046
[13804.460394] RAX: 0000000000000000 RBX: ffff96a955673d00 RCX: ffff96a955673d08
[13804.522268] RDX: ffffa8164950bbd8 RSI: 0000000000000246 RDI: ffff96a955673d00
[13804.585257] RBP: 7fffffffffffffff R08: ffff969ec0a9cb00 R09: 0000000000000004
[13804.649471] R10: 0000000000000000 R11: 0000000000000001 R12: ffff96a955673cf8
[13804.714336] R13: ffff96a108db1540 R14: ffffa816494d19a0 R15: ffffa816494d19a0
[13804.779665] FS:  00007fbb8ddf9700(0000) GS:ffff96a15fcc0000(0000) knlGS:0000000000000000
[13804.846461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13804.913393] CR2: 0000000000000000 CR3: 000000004780a002 CR4: 00000000001626e0
[13804.981141] Call Trace:
[13805.047998]  ? wake_up_q+0x70/0x70
[13805.114769]  kthread_stop+0x42/0xf0
[13805.180976]  kvm_dsm_free+0xf7/0x150 [kvm]
[13805.247716]  kvm_arch_destroy_vm+0x148/0x1a0 [kvm]
[13805.315126]  kvm_put_kvm+0x146/0x250 [kvm]
[13805.382700]  kvm_vm_release+0x1d/0x30 [kvm]
[13805.450205]  __fput+0xd8/0x210
[13805.518145]  task_work_run+0x8a/0xb0
[13805.586954]  do_exit+0x2e0/0xb30
[13805.656252]  ? get_futex_key+0x2ed/0x3d0
[13805.726392]  do_group_exit+0x3a/0xa0
[13805.796996]  get_signal+0x27a/0x5b0
[13805.867358]  do_signal+0x36/0x6d0
[13805.937525]  ? do_sigtimedwait+0xc6/0x230
[13806.008676]  exit_to_usermode_loop+0x89/0xf0
[13806.080775]  do_syscall_64+0xf3/0x110
[13806.153527]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[13806.226495] RIP: 0033:0x7fbb96324ad3
[13806.298945] Code: Bad RIP value.
[13806.371863] RSP: 002b:00007fbb8ddf8ac0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[13806.447905] RAX: fffffffffffffe00 RBX: 0000557f16ffebf0 RCX: 00007fbb96324ad3
[13806.525252] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000557f16ffec18
[13806.603414] RBP: 0000557f16ffec14 R08: 0000000000000000 R09: 0000000000000000
[13806.682121] R10: 0000000000000000 R11: 0000000000000246 R12: 0000557f16ffec18
[13806.760447] R13: 0000000000000000 R14: 0000557f15016240 R15: 0000000000000008
[13806.837714] Modules linked in: ib_umad ib_ipoib rpcrdma sunrpc rdma_ucm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass intel_cstate ipmi_si mei_me ipmi_devintf dcdbas intel_rapl_perf input_leds lpc_ich mei ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel ib_iser rdma_cm configfs iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd hid_generic cryptd mlx5_core usbhid mxm_wmi glue_helper ahci mlxfw hid tg3 megaraid_sas devlink
[13807.362582]  libahci wmi
[13807.454423] CR2: 0000000000000000
[13807.546754] ---[ end trace 2029a70eb51272bd ]---
[13807.685127] RIP: 0010:wait_for_completion+0x8c/0x150
[13807.778873] Code: 00 48 c7 44 24 10 b0 45 eb 86 c7 04 24 01 00 00 00 49 89 54 24 18 48 bd ff ff ff ff ff ff ff 7f 48 89 4c 24 18 48 89 44 24 20 <48> 89 10 eb 05 48 85 ed 74 3d 65 48 8b 04 25 00 5c 01 00 48 89 df
[13807.977468] RSP: 0018:ffffa8164950bbc0 EFLAGS: 00010046
[13808.076352] RAX: 0000000000000000 RBX: ffff96a955673d00 RCX: ffff96a955673d08
[13808.177519] RDX: ffffa8164950bbd8 RSI: 0000000000000246 RDI: ffff96a955673d00
[13808.279722] RBP: 7fffffffffffffff R08: ffff969ec0a9cb00 R09: 0000000000000004
[13808.383363] R10: 0000000000000000 R11: 0000000000000001 R12: ffff96a955673cf8
[13808.488500] R13: ffff96a108db1540 R14: ffffa816494d19a0 R15: ffffa816494d19a0
[13808.594944] FS:  00007fbb8ddf9700(0000) GS:ffff96a15fcc0000(0000) knlGS:0000000000000000
[13808.703752] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13808.812343] CR2: 00007fbb96324aa9 CR3: 000000004780a002 CR4: 00000000001626e0
[13808.920989] Fixing recursive fault but reboot is needed!
dell-4 $ dmesg
[13139.622137] kvm_dsm_init: Enable kvm dsm mode, this kvm instance will be node-0
[13139.622142] kvm_dsm_init: kvm_dsm_init: kvm 0 use RDMA connection
[13139.622249] rdma_bind_addr failed, ret -99
[13485.424318] kvm-dsm: node-0 stopping dsm server
[13485.424328] general protection fault: 0000 [#1] SMP PTI
[13485.490123] CPU: 9 PID: 15591 Comm: qemu-vcpu/1 Not tainted 4.18.20-gvm1 #1
[13485.558142] Hardware name: Dell Inc. PowerEdge R530/0CN7X8, BIOS 2.7.1 01/26/2018
[13485.628262] RIP: 0010:native_queued_spin_lock_slowpath+0x174/0x1c0
[13485.699911] Code: ff 0f 84 e6 fe ff ff e9 1c ff ff ff c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 04 48 63 f6 48 05 00 39 02 00 48 03 04 f5 00 57 96 9d <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
[13485.855008] RSP: 0018:ffffac1ee8f3bbb0 EFLAGS: 00010002
[13485.934340] RAX: 655f636970639864 RBX: ffff8f2134f5a5e0 RCX: 0000000000280000
[13486.014489] RDX: ffff8f217fd23900 RSI: 0000000000000d3b RDI: ffff8f2134f5a5e0
[13486.094801] RBP: ffff8f2134f5a5a0 R08: ffff8f19493d4b00 R09: 0000000000000004
[13486.175357] R10: ffff8f18ffb58000 R11: 0000000000000001 R12: ffff8f2134f5a5d8
[13486.257301] R13: ffff8f1956703300 R14: ffffac1ee8f219a0 R15: ffffac1ee8f219a0
[13486.339979] FS:  00007fefe3fff700(0000) GS:ffff8f217fd00000(0000) knlGS:0000000000000000
[13486.423255] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13486.505578] CR2: 00007fefe8b139e0 CR3: 0000000f19e0a005 CR4: 00000000001626e0
[13486.588065] Call Trace:
[13486.668682]  _raw_spin_lock_irq+0x24/0x27
[13486.748686]  wait_for_completion+0x32/0x150
[13486.828076]  kthread_stop+0x42/0xf0
[13486.906720]  kvm_dsm_free+0xf7/0x150 [kvm]
[13486.985207]  kvm_arch_destroy_vm+0x148/0x1a0 [kvm]
[13487.063958]  kvm_put_kvm+0x146/0x250 [kvm]
[13487.142681]  kvm_vm_release+0x1d/0x30 [kvm]
[13487.221281]  __fput+0xd8/0x210
[13487.299228]  task_work_run+0x8a/0xb0
[13487.376875]  do_exit+0x2e0/0xb30
[13487.454414]  ? get_futex_key+0x2ed/0x3d0
[13487.531897]  do_group_exit+0x3a/0xa0
[13487.608484]  get_signal+0x27a/0x5b0
[13487.685019]  do_signal+0x36/0x6d0
[13487.761178]  ? do_sigtimedwait+0xc6/0x230
[13487.836909]  exit_to_usermode_loop+0x89/0xf0
[13487.912082]  do_syscall_64+0xf3/0x110
[13487.986656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[13488.061447] RIP: 0033:0x7fefeb028ad3
[13488.135665] Code: Bad RIP value.
[13488.209347] RSP: 002b:00007fefe3ffeac0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[13488.284627] RAX: fffffffffffffe00 RBX: 000056402a720310 RCX: 00007fefeb028ad3
[13488.359882] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000056402a720338
[13488.435598] RBP: 000056402a720334 R08: 0000000000000000 R09: 0000000000000000
[13488.511586] R10: 0000000000000000 R11: 0000000000000246 R12: 000056402a720338
[13488.587710] R13: 0000000000000000 R14: 0000564029e34240 R15: 0000000000000008
[13488.663808] Modules linked in: cpuid ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs rpcrdma sunrpc rdma_ucm ib_umad ib_ipoib intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass intel_cstate mei_me dcdbas intel_rapl_perf input_leds lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel ib_iser rdma_cm configfs iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc hid_generic aesni_intel usbhid aes_x86_64 crypto_simd cryptd mlx5_core mxm_wmi
[13489.195165]  glue_helper ahci tg3 hid mlxfw megaraid_sas devlink libahci wmi
[13489.289787] ---[ end trace ecf1afaf252a6817 ]---
[13489.430667] RIP: 0010:native_queued_spin_lock_slowpath+0x174/0x1c0
[13489.527375] Code: ff 0f 84 e6 fe ff ff e9 1c ff ff ff c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 04 48 63 f6 48 05 00 39 02 00 48 03 04 f5 00 57 96 9d <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32
[13489.731112] RSP: 0018:ffffac1ee8f3bbb0 EFLAGS: 00010002
[13489.832850] RAX: 655f636970639864 RBX: ffff8f2134f5a5e0 RCX: 0000000000280000
[13489.935296] RDX: ffff8f217fd23900 RSI: 0000000000000d3b RDI: ffff8f2134f5a5e0
[13490.039156] RBP: ffff8f2134f5a5a0 R08: ffff8f19493d4b00 R09: 0000000000000004
[13490.144322] R10: ffff8f18ffb58000 R11: 0000000000000001 R12: ffff8f2134f5a5d8
[13490.250821] R13: ffff8f1956703300 R14: ffffac1ee8f219a0 R15: ffffac1ee8f219a0
[13490.358434] FS:  00007fefe3fff700(0000) GS:ffff8f217fd00000(0000) knlGS:0000000000000000
[13490.468331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13490.579143] CR2: 00007fefeb028aa9 CR3: 0000000f19e0a005 CR4: 00000000001626e0
[13490.690580] Fixing recursive fault but reboot is needed!
YWHyuk commented 2 years ago

I built and installed QEMU-gvm-vcpupin like this,

$ sudo make install

So, when I run it with -h option, I can see -local-cpu option

del-4 $ qemu-system-x86_64 -h | grep local-cpu
-local-cpu [cpus=]n[,start=start][,iplist="ip1[ ip2]..."]

After reboot, I can't reproduce this bug 😢