ememos / GiantVM

9 stars 9 forks source link

QEMU-gvm-vcpupin: Segmentation fault in QEMU #18

Closed ChoKyuWon closed 3 years ago

ChoKyuWon commented 3 years ago

I run this command on server1:

/home/memos/GiantVM/QEMU-gvm-vcpupin/x86_64-softmmu/qemu-system-x86_64 
    -enable-kvm -hda bionic-server-cloudimg-amd64.img -hdb user-data.img 
    -cpu host,-kvm-asyncpf -machine kernel-irqchip=off 
    -smp 8 -m 32164 -nographic -serial mon:stdio -monitor telnet:127.0.0.1:1234,server,nowait
    -redir tcp:5556::22
    -local-cpu 4,start=0,iplist="10.10.20.51 10.10.20.52"

Here is QEMU message:

CPU Info
Total: 8
Local: 4 [0-3]
Remote: 4[ 4 5 6 7 ]
KVM API version[12], QEMU version[12]
WARNING: Image format was not specified for 'user-data.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
CPU 4 is remote CPU, pause
CPU 5 is remote CPU, pause
CPU 6 is remote CPU, pause
CPU 7 is remote CPU, pause
start kvm dsm server, total memory size: 33726398464
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
error: msi send vector in range 0-15
error: cannot find current apic
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
run_on_cpu: not local CPU, ignore here. This could be a latent bug.
QEMU nums: 2, Total CPU nums: 8, CPU per QEMU: 4
connect_io_router...
QEMU 1 wait for RDMA connection on 10.10.20.51:40002
RDMA listen to 10.10.20.51:40002
RDMA ERROR: result not equal to event_addr_resolved RDMA_CM_EVENT_ADDR_ERROR
rdma_resolve_addr: No such file or directory
Segmentation fault (core dumped)

And here is kernel log:

[ 8693.391278] kvm_dsm_init: Enable kvm dsm mode, this kvm instance will be node-0
[ 8693.391281] kvm_dsm_init: kvm_dsm_init: kvm 0 use RDMA connection
[ 8705.836058] qemu-system-x86[6768]: segfault at 158 ip 00007fda96a276d2 sp 00007ffe1039d238 error 4 in librdmacm.so.1.1.17.1[7fda96a23000+15000]
[ 8705.836068] Code: ff 0f 1f 80 00 00 00 00 8b 55 28 89 57 50 89 c2 8b 44 16 fc 89 44 11 fc e9 55 ff ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 <81> bf 58 01 00 00 06 01 00 00 74 02 f3 c3 48 83 7f 18 00 74 49 48
[ 8705.957526] kvm-dsm: node-0 stopping dsm server
[ 8708.518504] kvm-dsm: node-0 most frequently read 10 pages
[ 8708.518505]  vfn     gfn     read    write
[ 8708.518506]  ffffffff96e001b0        [18446744071945847204,176]      4294967295      2531262884
[ 8708.518507]  ffffffff96e001b0        [18446744071945847204,176]      4294967295      2531262884
[ 8708.518507]  ffffffff96e001b0        [18446744071945847204,176]      4294967295      2531262884
[ 8708.518508]  ffffffff9642ac22        [18446637830709968896,176]      4294967295      2531262884
[ 8708.518508]  ffffffff96e001b0        [1800485635461445632,64]        4294942559      2880060416
[ 8708.518509]  ffffffff97613740        [0,0]   4294942567      3342925040
[ 8708.518510]  ffffffff96d45f9f        [18446744071944986363,0]        0       1606561088
[ 8708.518510]  17      [18446744069414584324,0]        419208229       4294967295
[ 8708.518511]  ffff9f5f570c2bc0        [18446637830563965888,70]       0       3231988400
[ 8708.518511]  ffff9f5f56fe9488        [18446744072644599240,0]        0       2530399920
[ 8708.518511] kvm-dsm: node-0 most frequently written 10 pages
[ 8708.518512]  vfn     gfn     read    write
[ 8708.518512]  0       [18446744072646572720,50]       4294967295      1459524744
[ 8708.518513]  ffffffffc0861dc8        [18446637830563075432,23]       0       582
[ 8708.518513]  ffffffff96d2dc58        [0,32]  4294942559      0
[ 8708.518514]  ffffffff96d2d6b0        [1,176] 4294967295      2530400677
[ 8708.518514]  ffff9f5f56fe9b60        [18446637830584940320,248]      4294942559      14
[ 8708.518515]  dead000000000100        [18446744071944985688,213]      4294967295      2530504226
[ 8708.518515]  ffff9f5eeba3aa40        [642,0] 3735879680      256
[ 8708.518516]  ffffffffc0a43ab0        [1800485635461445632,192]       4294942567      3953371136
[ 8708.518516]  ffffffff9771cbc0        [18446744072644799985,0]        4294942559      4294966296
[ 8708.518517]  ffffac68c7395000        [18446637830563962880,24]       0       3342422016
[ 8708.518517] kvm-dsm: node-0 total page faults 0
[ 8708.518518] kvm-dsm: node-0 average bytes 0
[ 8708.518518] kvm-dsm: node-0 average tx latency 0us
[ 8708.518603] kvm_dsm_free: node-0 dsm root server exited with -1000

I look around the QEMU code and found error message is printed by qemu_rdma_resolve_host() function, but for more information, attach a debugger and run that command, It runs without segfault. What a weird bug?

ememos commented 3 years ago

[Precautions Before Starting Giant VM]

  1. Make sure the rdma module is running and the IB's ip address is alive.
  2. When you start a Giant VM, take measures to ensure that no qemu process residues remain as a result of the previous test. ex. ps -ef | grep qemu, then kill -9 ....