erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.36k stars 2.95k forks source link

JIT related crash when Erlang is running on ESXi 8.02 #8904

Closed lhoguin closed 1 week ago

lhoguin commented 2 weeks ago

Describe the bug We have a reproducible way of triggering a crash of the VM but only on an ESXi 8.02 environment. The environment is running the RabbitMQ tile with RabbitMQ 3.13.6 but this might not matter. We can make the crash happen within minutes of pushing messages through RabbitMQ, sometimes within 30 seconds.

The crashes don't seem to leave an erl_crash.dump. We have investigated the core files produced by the crashes, but they're hard to make sense of as the thread where the crash occurs has something that looks more like a virtual function call table than what gdb expects:

(gdb) bt
#0  0x00007efe81c9a021 in ?? ()
#1  0x00007efe82579b2c in ?? ()
#2  0x000000000000038b in ?? ()
#3  0x000000000000004b in ?? ()
#4  0x000000000000ae0b in ?? ()
#5  0x00000000000804cb in ?? ()
#6  0xffffffffffffffff in ?? ()
#7  0xffffffffffffff9f in ?? ()
#8  0x00007efdda9b13c1 in ?? ()
#9  0x00007efe8257c1f4 in ?? ()
#10 0x000000000000038b in ?? ()
#11 0x000000000000004b in ?? ()
#12 0x000000000000ae0b in ?? ()
#13 0x00000000000804cb in ?? ()
#14 0xffffffffffffffff in ?? ()
#15 0xffffffffffffffaf in ?? ()
#16 0x000000000000003f in ?? ()
#17 0x00007efdda9a52da in ?? ()
#18 0x00007efe8257c2c8 in ?? ()
#19 0x00007efdde479c1a in ?? ()
#20 0x00007efe8257c2c8 in ?? ()
#21 0x00007efdde479b72 in ?? ()
#22 0x00007efe8257bdb4 in ?? ()
#23 0x000000000000003b in ?? ()
#24 0x00007efdda9a52da in ?? ()
#25 0x00007efe8257d2d8 in ?? ()
#26 0x00007efdda9a5301 in ?? ()
#27 0x000000000000038b in ?? ()
#28 0x000000000000004b in ?? ()
#29 0x000000000000ae0b in ?? ()
#30 0x00000000000804cb in ?? ()
#31 0xffffffffffffffff in ?? ()
...

One time it was making sense though. But maybe it's just a symptom:

(gdb) bt
#0  sweep (src_size=0, src=0x0, ohsz=140335560460856, oh=0x7fa26b3b09f0 "\270\226\275\325\031V",
    type=ErtsSweepNewHeap, n_htop=<optimized out>, n_hp=0x7fa1e659b530) at beam/erl_gc.c:2269
#1  sweep_new_heap (n_hp=<optimized out>, n_htop=<optimized out>,
    old_heap=old_heap@entry=0x7fa1e613f028 "\200", old_heap_size=old_heap_size@entry=4561192)
    at beam/erl_gc.c:2315
#2  0x00005619d5372b46 in do_minor (p=p@entry=0x7fa269be7678,
    live_hf_end=live_hf_end@entry=0xfffffffffffffff8,
    mature=mature@entry=0x7fa1e9357028 "R\211Y\346\241\177", mature_size=mature_size@entry=137032,
    new_sz=46422, objv=objv@entry=0x5619d5b2ff80, nobj=<optimized out>) at beam/erl_gc.c:1760
#3  0x00005619d537635f in minor_collection (recl=<synthetic pointer>, ygen_usage=36411, nobj=<optimized out>,
    objv=<optimized out>, need=<optimized out>, live_hf_end=<optimized out>, p=0x7fa269be7678)
    at beam/erl_gc.c:1450
#4  garbage_collect (p=p@entry=0x7fa269be7678, live_hf_end=<optimized out>,
    live_hf_end@entry=0xfffffffffffffff8, need=need@entry=0, objv=objv@entry=0x5619d5b2ff80,
    nobj=nobj@entry=0, fcalls=fcalls@entry=4000, max_young_gen_usage=<optimized out>) at beam/erl_gc.c:763
#5  0x00005619d537792c in erts_garbage_collect_nobump (p=0x7fa269be7678, need=<optimized out>,
    objv=0x5619d5b2ff80, nobj=0, fcalls=4000) at beam/erl_gc.c:902
#6  0x00005619d511e9a8 in erts_schedule (esdp=<optimized out>, p=0x7fa269be7678, calls=<optimized out>)
    at beam/erl_process.c:10276
#7  0x00007fa26d600b35 in ?? ()
#8  0x0000000000000000 in ?? ()

We have tried the debug JIT and that didn't give us any useful information. The core file looked the same as the first one, and we got nothing more than that even with the debug VM.

We have tried with the debug emu flavor, as well as the normal emu flavor, and were not able to reproduce using these.

We are looking for assistance in figuring out this crash, and appreciate any tips that may help confirm whether the problem indeed comes from the JIT.

My plan is to try to run RabbitMQ through gdb on Monday but I am not certain that will be easily possible.

To Reproduce Reproducing requires ESXi 8.02. We can screen share and give you access to an environment where you could investigate.

We can of course share core files from our environment if you think that will be enough to figure things out.

The problem does not happen in ESXi 7 or earlier nor did we reproduce in any other environment.

Expected behavior No crash.

Affected versions Both OTP-26.2.5 and OTP-27.1 have been tested and crash.

michaelklishin commented 2 weeks ago

Providing direct access to the environment may prove to be more complicated than it sounds but screen sharing and providing any details the Erlang/OTP needs goes without saying.

jhogberg commented 2 weeks ago

Thanks for your report! You get call stacks like that when crashing in JITted code. It’ll make more sense if you run (in gdb) source $OTP_REPO/erts/etc/unix/etp-commands followed by bt (oretp-stackdump-jit if you wish to see variables on the stack, too).

I’ll look deeper into it on Monday, I can probably join a screen sharing session in the afternoon. My e-mail is in the commit log. Try to install rr if you can, it’ll greatly cut down the time required to debug this.

lhoguin commented 2 weeks ago

Output is incomplete due to some issues I have but here's some more data. The function it crashes in is https://github.com/rabbitmq/rabbitmq-server/blob/v3.13.x/deps/rabbit/src/rabbit_variable_queue.erl#L1251-L1280

Core was generated by `/home/bitnami/otp_install/erts-14.2.5.3/bin/beam.debug.smp -W w -MBas ageffcbf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f0434c3df6c in ?? ()
[Current thread is 1 (Thread 0x7f043cbfe6c0 (LWP 61412))]
(gdb) source /home/bitnami/otp/erts/etc/unix/etp-commands
%---------------------------------------------------------------------------
% Use etp-help for a command overview and general help.
%
% To use the Erlang support module, the environment variable ROOTDIR
% must be set to the toplevel installation directory of Erlang/OTP,
% so the etp-commands file becomes:
%     $ROOTDIR/erts/etc/unix/etp-commands
% Also, erl and erlc must be in the path.
%---------------------------------------------------------------------------
etp-set-max-depth 20
etp-set-max-string-length 100
--------------- System Information ---------------
OTP release: 26
ERTS version: 14.2.5.3
Arch: x86_64-pc-linux-gnu
Endianness: Little
Word size: 64-bit
BeamAsm support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported and used
Debug compiled: yes
Lock checking: yes
Lock counting: no
Node name: rabbit@photon
Number of schedulers: 4
Number of async-threads: 1
--------------------------------------------------
(gdb) bt
#0  0x00007f0434c3df6c in rabbit_variable_queue:a/1 ()
#1  0x00007f0434b5d4d8 in rabbit_priority_queue:publish_delivered/5 () at rabbit_priority_queue.erl:232
#2  0x00007f0434898b04 in rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11 ()
    at rabbit_amqqueue_process.erl:682
#3  0x00007f0434b71450 in rabbit_queue_consumers:deliver_to_consumer/4 () at rabbit_queue_consumers.erl:275
#4  0x00007f0434b71278 in rabbit_queue_consumers:deliver_to_consumer/3 () at rabbit_queue_consumers.erl:262
#5  0x00007f0434b70b14 in rabbit_queue_consumers:deliver/6 () at rabbit_queue_consumers.erl:238
#6  0x00007f0434886a64 in rabbit_amqqueue_process:attempt_delivery/4 () at rabbit_amqqueue_process.erl:680
#7  0x00007f04348875fc in rabbit_amqqueue_process:deliver_or_enqueue/3 () at rabbit_amqqueue_process.erl:749
#8  0x00007f0434891dc8 in rabbit_amqqueue_process:handle_cast/2 () at rabbit_amqqueue_process.erl:1576
#9  0x00007f0434cadf80 in gen_server2:handle_msg/2 () at gen_server2.erl:1056
#10 0x00007f0434204f04 in proc_lib:wake_up/3 () at proc_lib.erl:251
#11 0x00007f0434002490 in erts_beamasm:normal_exit/0-CodeInfoPrologue ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) etp-stackdump-jit
% Stacktrace (48)
I: #Cp<rabbit_variable_queue:a/1+0xb4> @ "rabbit_variable_queue.erl":1251.
0: #Cp<rabbit_variable_queue:publish_delivered/5+0x108> @ "rabbit_variable_queue.erl":581.
1: 57138.
2: #Cp<rabbit_priority_queue:publish_delivered/5+0x2f0> @ "rabbit_priority_queue.erl":232.
3: {passthrough,rabbit_variable_queue,{vqstate,{0,[],[]},{0,[],[]},{delta,undefined,0,0,undefined},{0,[],[]},{
0,[],[]},57138,57138,#<3>{#<101>{[57135|{msg_status,57135,#HeapBinary<0x10,0xf20118b3277378bd,
0xac0053b1d635335e>,{mc,mc_amqpl,{content,60,none,#HeapBinary<0x3,0xbababababa020010>,rabbit_framing_amqp_0_9_
1,[#RefcBinary<0x1004,0x7f03c716a4e0,0x7f03b981f038,0x7f03b9a01513,(nil)>]},#<3>{#<1>{#<4001>{[rts|
1728050359701],[x|#HeapBinary<0x6,0xbaba746365726964>]}},#<101>{[rk,#HeapBinary<0x4,0xbabababa35716370>],[id|
#HeapBinary<0x10,0xf20118b3277378bd,0xac0053b1d635335e>]}}},true,true,rabbit_msg_store,true,msg_store,{message
_properties,undefined,false,4100}}],[57134|{msg_status,57134,#HeapBinary<0x10,0x8b732b0c095f21c4,
0x7a3560ef5e014b02>,{mc,mc_amqpl,{content,60,none,#HeapBinary<0x3,0xbababababa020010>,rabbit_framing_amqp_0_9_
1,[#RefcBinary<0x1004,0x7f03c716ab60,0x7f03b981f038,0x7f03b9a004d5,(nil)>]},#<3>{#<1>{#<4001>{[rts|
1728050359698],[x|#HeapBinary<0x6,0xbaba746365726964>]}},#<101>{[rk,#HeapBinary<0x4,0xbabababa35716370>],[id|
#HeapBinary<0x10,0x8b732b0c095f21c4,0x7a3560ef5e014b02>]}}},true,true,rabbit_msg_store,true,msg_store,{message
_properties,undefined,false,4100}}]},#<100>{#<44>{[57136|{msg_status,57136,#HeapBinary<0x10,
0x49fa0b346dce20d9,0x9fcf386a33cf6bdb>,{mc,mc_amqpl,{content,60,none,#HeapBinary<0x3,0xbababababa020010>,rabbi
t_framing_amqp_0_9_1,[#RefcBinary<0x1004,0x7f03c717b848,0x7f03b981f038,0x7f03b9a02551,(nil)>]},#<3>{#<1>{
#<4001>{[rts|1728050359704],[x|#HeapBinary<0x6,0xbaba746365726964>]}},#<101>{[rk,#HeapBinary<0x4,
0xbabababa35716370>],[id|#HeapBinary<0x10,0x49fa0b346dce20d9,0x9fcf386a33cf6bdb>]}}},true,true,rabbit_msg_stor
e,true,msg_store,{message_properties,undefined,false,4100}}],[57137|{msg_status,57137,#HeapBinary<0x10,
0x830862e7f93d0208,0xb0c7093c2f083156>,{mc,mc_amqpl,{content,60,none,#HeapBinary<0x3,0xbababababa020010>,rabbi
t_framing_amqp_0_9_1,[#RefcBinary<0x1004,0x7f03c717be30,0x7f03b981f038,0x7f03b9a0358f,(nil)>]},#<3>{#<1>{
#<4001>{[rts|1728050359707],[x|#HeapBinary<0x6,0xbaba746365726964>]}},#<101>{[rk,#HeapBinary<0x4,
0xbabababa35716370>],[id|#HeapBinary<0x10,0x830862e7f93d0208,0xb0c7093c2f083156>]}}},true,true,rabbit_msg_stor
e,true,msg_store,{message_properties,undefined,false,4100}}]}}},#{Keys:{} Values:{}},undefined,rabbit_queue_in
dex,{qistate,"/var/lib/rabbitmq/mnesia/rabbit@photon/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/169NVA
G94O"++[...],{#{Keys:{25} Values:{{segment,25,"/var/lib/rabbitmq/mnesia/rabbit@photon/msg_stores/vhosts/628WB7
9CIFDYO9LJI6DKMI09L/queues/169NVAG94O"++[...],{array,2048,0,undefined,{{100,100,100,100,100,100,100,100,{10,10
,10,10,10,10,10,{undefined,undefined,undefined,undefined,undefined,undefined,undefined,{no_pub,no_del,ack},{no
_pub,no_del,ack},{no_pub,no_del,ack}},{{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,no_
del,ack},{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,no_del,ack},{
no_pub,no_del,ack}},{{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,no_del,ack},{no_pub,n
o_del,ack},{no_pub,no_del,ack},{no_pub,del,ack},undefined,undefined,undefined},10},{{undefined,undefined,undef
ined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,und
efined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,u
ndefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined
,undefined,undefined,undefined,undefined,undefined},{undef

...
jhogberg commented 2 weeks ago

Wonderful. x/5i $pc ?

lhoguin commented 2 weeks ago
(gdb) x/5i $pc
=> 0x7f0434c3df6c <rabbit_variable_queue:a/1+100>:  cmp    DWORD PTR [rsi-0x2],0x140
   0x7f0434c3df73 <rabbit_variable_queue:a/1+107>:  jne    0x7f0434c3e573 <rabbit_variable_queue:a/1+23>
   0x7f0434c3df79 <rabbit_variable_queue:a/1+113>:  cmp    QWORD PTR [rsi+0x6],0x1479cb
   0x7f0434c3df81 <rabbit_variable_queue:a/1+121>:  jne    0x7f0434c3e573 <rabbit_variable_queue:a/1+23>
   0x7f0434c3df87 <rabbit_variable_queue:a/1+127>:  movabs rcx,0xdeadbeaf0000001b
jhogberg commented 2 weeks ago

info registers

p $_siginfo

lhoguin commented 2 weeks ago

(gdb) info registers
rax            0x7f03d40884ba      139654418957498
rbx            0x7f043cbf9d00      139656175787264
rcx            0xdeadbeaf0000001b  -2401053367490052069
rdx            0x7f0434c3df18      139656041848600
rsi            0x0                 0
rdi            0x0                 0
rbp            0x7f03d40b7918      0x7f03d40b7918
rsp            0x7f03d40b7968      0x7f03d40b7968
r8             0x179               377
r9             0x7f043cbf99e8      139656175786472
r10            0x7fffad3d3080      140736099856512
r11            0x7                 7
r12            0x1                 1
r13            0x7f042d4a2270      139655916429936
r14            0x112               274
r15            0x7f03d4088648      139654418957896
rip            0x7f0434c3df6c      0x7f0434c3df6c <rabbit_variable_queue:a/1+100>
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0xf8fe0100          4177395968
k1             0x1fffffff          536870911
k2             0xf                 15
k3             0x0                 0
k4             0x0                 0
k5             0x0                 0
k6             0x0                 0
k7             0x0                 0
(gdb) p $_siginfo
$1 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {-2, -1, 0 <repeats 26 times>}, _kill = {
      si_pid = -2, si_uid = 4294967295}, _timer = {si_tid = -2, si_overrun = -1, si_sigval = {sival_int = 0,
        sival_ptr = 0x0}}, _rt = {si_pid = -2, si_uid = 4294967295, si_sigval = {sival_int = 0,
        sival_ptr = 0x0}}, _sigchld = {si_pid = -2, si_uid = 4294967295, si_status = 0, si_utime = 0,
      si_stime = 0}, _sigfault = {si_addr = 0xfffffffffffffffe, _addr_lsb = 0, _addr_bnd = {_lower = 0x0,
        _upper = 0x0}}, _sigpoll = {si_band = -2, si_fd = 0}, _sigsys = {_call_addr = 0xfffffffffffffffe,
      _syscall = 0, _arch = 0}}}
jhogberg commented 2 weeks ago

Thanks, that's about as far as we can get outside of a shared session. Did you have any luck installing and getting rr working?

lhoguin commented 2 weeks ago

Not yet. I can't seem install the Fedora package (maybe I can force it?) and from source there's a few packages that are not available.

jhogberg commented 2 weeks ago

Okay, in the meantime, can you run cat /proc/cpuinfo under both versions of ESXi?

lhoguin commented 2 weeks ago

I only have access to ESXi 8.02 right now but I will ask around for a 7.x output as well. So far ESXi 8.02 we have:


$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 106
model name  : Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
stepping    : 6
microcode   : 0xd0003a5
cpu MHz     : 1995.312
cache size  : 43008 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 27
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
bugs        : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit mmio_stale_data eibrs_pbrsb gds bhi
bogomips    : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 45 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 106
model name  : Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
stepping    : 6
microcode   : 0xd0003a5
cpu MHz     : 1995.312
cache size  : 43008 KB
physical id : 0
siblings    : 4
core id     : 1
cpu cores   : 4
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 27
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
bugs        : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit mmio_stale_data eibrs_pbrsb gds bhi
bogomips    : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 45 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 106
model name  : Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
stepping    : 6
microcode   : 0xd0003a5
cpu MHz     : 1995.312
cache size  : 43008 KB
physical id : 0
siblings    : 4
core id     : 2
cpu cores   : 4
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 27
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
bugs        : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit mmio_stale_data eibrs_pbrsb gds bhi
bogomips    : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 45 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 106
model name  : Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
stepping    : 6
microcode   : 0xd0003a5
cpu MHz     : 1995.312
cache size  : 43008 KB
physical id : 0
siblings    : 4
core id     : 3
cpu cores   : 4
apicid      : 3
initial apicid  : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 27
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
bugs        : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit mmio_stale_data eibrs_pbrsb gds bhi
bogomips    : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 45 bits physical, 48 bits virtual
power management:
lhoguin commented 2 weeks ago

I was sent the following output from a different environment running ESXi 7.0.3:


processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 106
model name  : Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
stepping    : 6
microcode   : 0xd000311
cpu MHz     : 1995.312
cache size  : 43008 KB
physical id : 0
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 27
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
bugs        : apic_c1e spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit mmio_stale_data eibrs_pbrsb gds bhi
bogomips    : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 106
model name  : Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
stepping    : 6
microcode   : 0xd000311
cpu MHz     : 1995.312
cache size  : 43008 KB
physical id : 2
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 27
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
bugs        : apic_c1e spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit mmio_stale_data eibrs_pbrsb gds bhi
bogomips    : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management:
lhoguin commented 2 weeks ago

As for rr the environment is too constrained for me to be able to install or run it (missing packages and other issues). Rather than rebuild the world from source I have been querying colleagues about it to see if there's a simpler way to make it available.

jhogberg commented 2 weeks ago

Hm, that's very interesting. Can you x/200i 0x00007f0434898b04?

# That is, this function.
#2  0x00007f0434898b04 in rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11 ()
    at rabbit_amqqueue_process.erl:682
lhoguin commented 2 weeks ago
(gdb) x/200i 0x00007f0434898b04
   0x7f0434898b04 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+100>:
    mov    rsi,QWORD PTR [rbx]
   0x7f0434898b07 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+103>:    test   rsi,rsi
   0x7f0434898b0a <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+106>:    test   sil,0x1
   0x7f0434898b0e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+110>:
    jne    0x7f0434898d3d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11>
   0x7f0434898b14 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+116>:
    and    rsi,0xfffffffffffffffb
   0x7f0434898b18 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+120>:
    cmp    DWORD PTR [rsi-0x2],0x80
   0x7f0434898b1f <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+127>:
    jne    0x7f0434898d3d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11>
   0x7f0434898b25 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+133>:
    movabs rcx,0xdeadbeaf0000001b
   0x7f0434898b2f <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+143>:
    mov    QWORD PTR [rbx+0x8],rcx
   0x7f0434898b33 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+147>:
    mov    QWORD PTR [rbx+0x10],rcx
   0x7f0434898b37 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+151>:    lea    rdx,[r15+0x70]
   0x7f0434898b3b <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+155>:    cmp    rdx,rsp
   0x7f0434898b3e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+158>:
    jbe    0x7f0434898b4c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+172>
   0x7f0434898b40 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+160>:    mov    ecx,0x1
   0x7f0434898b45 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+165>:    nop
   0x7f0434898b46 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+166>:
    rex call 0x7f0434000c50 <global::garbage_collect>
   0x7f0434898b4c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+172>:
    mov    rsi,QWORD PTR [rbx]
   0x7f0434898b4f <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+175>:    test   rsi,rsi
   0x7f0434898b52 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+178>:
    and    rsi,0xfffffffffffffffb
   0x7f0434898b56 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+182>:
    mov    rax,QWORD PTR [rbx]
   0x7f0434898b59 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+185>:    test   rax,rax
   0x7f0434898b5c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+188>:    test   al,0x1
   0x7f0434898b5e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+190>:
    jne    0x7f0434898b69 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+201>
   0x7f0434898b60 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+192>:
    and    rax,0xfffffffffffffffb
   0x7f0434898b64 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+196>:    cmp    rax,rsi
   0x7f0434898b67 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+199>:
    je     0x7f0434898b6b <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+203>
--Type <RET> for more, q to quit, c to continue without paging--c
   0x7f0434898b69 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+201>:    ud2
   0x7f0434898b6b <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+203>:
    vpermilpd xmm0,XMMWORD PTR [rsi+0x6],0x1
   0x7f0434898b72 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+210>:
    vmovups XMMWORD PTR [rbx],xmm0
   0x7f0434898b76 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+214>:
    mov    QWORD PTR [r15],0xc0
   0x7f0434898b7d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+221>:
    vmovups xmm0,XMMWORD PTR [rsp+0x8]
   0x7f0434898b83 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+227>:
    vmovups XMMWORD PTR [r15+0x8],xmm0
   0x7f0434898b89 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+233>:
    mov    rdi,QWORD PTR [rbx+0x8]
   0x7f0434898b8d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+237>:    test   rdi,rdi
   0x7f0434898b90 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+240>:
    mov    QWORD PTR [r15+0x18],rdi
   0x7f0434898b94 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+244>:    lea    rdi,[r15+0x2]
   0x7f0434898b98 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+248>:    add    r15,0x20
   0x7f0434898b9c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+252>:
    mov    QWORD PTR [rbx+0x8],rdi
   0x7f0434898ba0 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+256>:
    mov    QWORD PTR [r15],0x80
   0x7f0434898ba7 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+263>:
    mov    rdi,QWORD PTR [rbx]
   0x7f0434898baa <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+266>:    test   rdi,rdi
   0x7f0434898bad <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+269>:
    mov    QWORD PTR [r15+0x8],rdi
   0x7f0434898bb1 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+273>:
    mov    rdi,QWORD PTR [rsp]
   0x7f0434898bb5 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+277>:    test   rdi,rdi
   0x7f0434898bb8 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+280>:
    mov    QWORD PTR [r15+0x10],rdi
   0x7f0434898bbc <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+284>:    lea    rdi,[r15+0x2]
   0x7f0434898bc0 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+288>:    add    r15,0x18
   0x7f0434898bc4 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+292>:
    mov    QWORD PTR [rbx],rdi
   0x7f0434898bc7 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+295>:
    mov    QWORD PTR [r15],0x80
   0x7f0434898bce <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+302>:
    vpermilpd xmm0,XMMWORD PTR [rbx],0x1
   0x7f0434898bd4 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+308>:
    vmovups XMMWORD PTR [r15+0x8],xmm0
   0x7f0434898bda <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+314>:    lea    rdi,[r15+0x2]
   0x7f0434898bde <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+318>:    add    r15,0x18
   0x7f0434898be2 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+322>:
    mov    QWORD PTR [rbx],rdi
   0x7f0434898be5 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+325>:    add    rsp,0x18
   0x7f0434898be9 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+329>:    dec    r14
   0x7f0434898bec <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+332>:
    jl     0x7f043489a2ec <rabbit_amqqueue_process::codeFooter+106>
   0x7f0434898bf2 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+338>:    ret
   0x7f0434898bf3 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+339>:
    movabs rcx,0xdeadbeaf0000001b
   0x7f0434898bfd <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+349>:
    mov    QWORD PTR [rbx+0x58],rcx
   0x7f0434898c01 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+353>:
    mov    QWORD PTR [rbx+0x60],rcx
   0x7f0434898c05 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+357>:    lea    rdx,[r15+0x50]
   0x7f0434898c09 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+361>:    cmp    rdx,rsp
   0x7f0434898c0c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+364>:
    jbe    0x7f0434898c1c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+380>
   0x7f0434898c0e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+366>:    mov    ecx,0xb
   0x7f0434898c13 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+371>:    nop    DWORD PTR [rax]
   0x7f0434898c16 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+374>:
    rex call 0x7f0434000c50 <global::garbage_collect>
   0x7f0434898c1c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+380>:    sub    rsp,0x30
   0x7f0434898c20 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+384>:
    vpermilpd xmm0,XMMWORD PTR [rbx+0x40],0x1
   0x7f0434898c27 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+391>:
    vmovups XMMWORD PTR [rsp],xmm0
   0x7f0434898c2c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+396>:
    mov    rdi,QWORD PTR [rbx+0x38]
   0x7f0434898c30 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+400>:    test   rdi,rdi
   0x7f0434898c33 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+403>:
    mov    QWORD PTR [rsp+0x10],rdi
   0x7f0434898c38 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+408>:
    mov    rdi,QWORD PTR [rbx+0x28]
   0x7f0434898c3c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+412>:    test   rdi,rdi
   0x7f0434898c3f <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+415>:
    mov    QWORD PTR [rsp+0x18],rdi
   0x7f0434898c44 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+420>:
    mov    rdi,QWORD PTR [rbx+0x18]
   0x7f0434898c48 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+424>:    test   rdi,rdi
   0x7f0434898c4b <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+427>:
    mov    QWORD PTR [rsp+0x20],rdi
   0x7f0434898c50 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+432>:
    mov    rdi,QWORD PTR [rbx+0x8]
   0x7f0434898c54 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+436>:    test   rdi,rdi
   0x7f0434898c57 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+439>:
    mov    QWORD PTR [rsp+0x28],rdi
   0x7f0434898c5c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+444>:
    mov    rdi,QWORD PTR [rbx+0x50]
   0x7f0434898c60 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+448>:    test   rdi,rdi
   0x7f0434898c63 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+451>:
    mov    QWORD PTR [rbx],rdi
   0x7f0434898c66 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11>:    movabs rax,0x7f042cded600
   0x7f0434898c70 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+10>: test   rax,rax
   0x7f0434898c73 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+13>: nop
   0x7f0434898c74 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+14>: call   QWORD PTR [rax+r12*8]
   0x7f0434898c78 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+18>: vmovups xmm0,XMMWORD PTR [rsp]
   0x7f0434898c7d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+23>:
    vmovups XMMWORD PTR [rbx+0x8],xmm0
   0x7f0434898c82 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+28>:
    mov    rdi,QWORD PTR [rsp+0x10]
   0x7f0434898c87 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+33>: test   rdi,rdi
   0x7f0434898c8a <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+36>:
    mov    QWORD PTR [rbx+0x18],rdi
   0x7f0434898c8e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+40>: mov    rdi,QWORD PTR [rbx]
   0x7f0434898c91 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+43>: test   rdi,rdi
   0x7f0434898c94 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+46>:
    mov    QWORD PTR [rbx+0x20],rdi
   0x7f0434898c98 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+50>:
    mov    rdi,QWORD PTR [rsp+0x28]
   0x7f0434898c9d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+55>: test   rdi,rdi
   0x7f0434898ca0 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+58>: mov    QWORD PTR [rbx],rdi
   0x7f0434898ca3 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+61>:
    mov    rdi,QWORD PTR [rsp+0x18]
   0x7f0434898ca8 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+66>: test   rdi,rdi
   0x7f0434898cab <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+69>:
    mov    QWORD PTR [rsp+0x28],rdi
   0x7f0434898cb0 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+74>: add    rsp,0x20
   0x7f0434898cb4 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+78>: nop    DWORD PTR [rax]
   0x7f0434898cb7 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+81>:
    call   0x7f0434885ec0 <rabbit_amqqueue_process:discard/5-CodeInfoPrologue+40>
   0x7f0434898cbc <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+86>: movabs rcx,0xdeadbeaf0000001b
   0x7f0434898cc6 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+96>: mov    QWORD PTR [rbx+0x8],rcx
   0x7f0434898cca <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+100>:
    mov    QWORD PTR [rbx+0x10],rcx
   0x7f0434898cce <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+104>:    lea    rdx,[r15+0x58]
   0x7f0434898cd2 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+108>:    cmp    rdx,rsp
   0x7f0434898cd5 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+111>:
    jbe    0x7f0434898ce4 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+126>
   0x7f0434898cd7 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+113>:    mov    ecx,0x1
   0x7f0434898cdc <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+118>:    xchg   ax,ax
   0x7f0434898cde <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+120>:
    rex call 0x7f0434000c50 <global::garbage_collect>
   0x7f0434898ce4 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+126>:
    mov    QWORD PTR [r15],0xc0
   0x7f0434898ceb <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+133>:
    vpermilpd xmm0,XMMWORD PTR [rsp],0x1
   0x7f0434898cf2 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+140>:
    vmovups XMMWORD PTR [r15+0x8],xmm0
   0x7f0434898cf8 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+146>:
    mov    QWORD PTR [r15+0x18],0x38b
   0x7f0434898d00 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+154>:    lea    rdi,[r15+0x2]
   0x7f0434898d04 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+158>:    add    r15,0x20
   0x7f0434898d08 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+162>:
    mov    QWORD PTR [rbx+0x8],rdi
   0x7f0434898d0c <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+166>:
    mov    QWORD PTR [r15],0x80
   0x7f0434898d13 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+173>:
    vpermilpd xmm0,XMMWORD PTR [rbx],0x1
   0x7f0434898d19 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+179>:
    vmovups XMMWORD PTR [r15+0x8],xmm0
   0x7f0434898d1f <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+185>:    lea    rdi,[r15+0x2]
   0x7f0434898d23 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+189>:    add    r15,0x18
   0x7f0434898d27 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+193>:
    mov    QWORD PTR [rbx],rdi
   0x7f0434898d2a <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+196>:    add    rsp,0x10
   0x7f0434898d2e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+200>:    dec    r14
   0x7f0434898d31 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+203>:
    jl     0x7f043489a2ec <rabbit_amqqueue_process::codeFooter+106>
   0x7f0434898d37 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+209>:    ret
   0x7f0434898d38 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+210>:
    jmp    0x7f043489a1d8 <rabbit_amqqueue_process:'-attempt_delivery/4-inlined-0-'/1-CodeInfoPrologue+40>
   0x7f0434898d3d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11>:    mov    rdi,QWORD PTR [rbx]
   0x7f0434898d40 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+3>:  test   rdi,rdi
   0x7f0434898d43 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+6>:
    mov    QWORD PTR [r13+0x78],rdi
   0x7f0434898d47 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+10>:
    mov    QWORD PTR [r13+0x70],0x1450
   0x7f0434898d4f <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+18>: xor    ecx,ecx
   0x7f0434898d51 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+20>:
    call   0x7f043489a2da <rabbit_amqqueue_process::codeFooter+88>
   0x7f0434898d56 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+25>: and    rcx,0xfffffffffffffffb
   0x7f0434898d5a <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+29>:
    vmovups zmm0,ZMMWORD PTR [rcx+0x26]
   0x7f0434898d64 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+39>:
    vmovups ZMMWORD PTR [rbx+0x8],zmm0
   0x7f0434898d6e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+49>:
    vmovups xmm0,XMMWORD PTR [rcx+0x66]
   0x7f0434898d73 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+54>:
    vmovups XMMWORD PTR [rbx+0x48],xmm0
   0x7f0434898d78 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+59>:
    jmp    0x7f04348989d0 <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11-CodeInfoPrologue+40>
   0x7f0434898d7d <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11>:    nop
   0x7f0434898d7e <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+1>:  nop
   0x7f0434898d7f <rabbit_amqqueue_process:'-attempt_delivery/4-fun-0-'/11+2>:  nop
   0x7f0434898d80 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue>:
    call   0x7f043489a2e6 <rabbit_amqqueue_process::codeFooter+100>
   0x7f0434898d85 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+5>: nop
   0x7f0434898d86 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+6>: nop
   0x7f0434898d87 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+7>:
    add    BYTE PTR [rax],al
   0x7f0434898d89 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+9>:
    add    BYTE PTR [rax],al
   0x7f0434898d8b <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+11>:
    add    BYTE PTR [rax],al
   0x7f0434898d8d <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+13>:
    add    BYTE PTR [rax],al
   0x7f0434898d8f <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+15>:
    add    BYTE PTR [rbx+0xd2f],cl
   0x7f0434898d95 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+21>:
    add    BYTE PTR [rax],al
   0x7f0434898d97 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+23>:
    add    BYTE PTR [rbx+0x10b1],cl
   0x7f0434898d9d <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+29>:
    add    BYTE PTR [rax],al
   0x7f0434898d9f <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+31>:
    add    BYTE PTR [rdx],al
   0x7f0434898da1 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+33>:
    add    BYTE PTR [rax],al
   0x7f0434898da3 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+35>:
    add    BYTE PTR [rax],al
   0x7f0434898da5 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+37>:
    add    BYTE PTR [rax],al
   0x7f0434898da7 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+39>:
    add    bl,ch
   0x7f0434898da9 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+41>:    (bad)
   0x7f0434898daa <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+42>:    nop
   0x7f0434898dab <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+43>:
    call   0x7f043489a2d4 <rabbit_amqqueue_process::codeFooter+82>
   0x7f0434898db0 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2>:
    lea    rdx,[rip+0x9]        # 0x7f0434898dc0 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+16>
   0x7f0434898db7 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+7>:  dec    r14
   0x7f0434898dba <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+10>:
    jle    0x7f043489a2bc <rabbit_amqqueue_process::codeFooter+58>
   0x7f0434898dc0 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+16>:
    jmp    0x7f0434888e48 <rabbit_amqqueue_process:fetch/2-CodeInfoPrologue+40>
   0x7f0434898dc5 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+21>: and    rcx,0xfffffffffffffffb
   0x7f0434898dc9 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+25>:
    mov    rax,QWORD PTR [rcx+0x26]
   0x7f0434898dcd <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+29>: mov    QWORD PTR [rbx+0x8],rax
   0x7f0434898dd1 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+33>:
    jmp    0x7f0434898da8 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2-CodeInfoPrologue+40>
   0x7f0434898dd3 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2>:    nop
   0x7f0434898dd4 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+1>:  nop
   0x7f0434898dd5 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+2>:  nop
   0x7f0434898dd6 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+3>:  nop
   0x7f0434898dd7 <rabbit_amqqueue_process:'-run_message_queue/2-fun-0-'/2+4>:  nop
   0x7f0434898dd8 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue>:
    call   0x7f043489a2e6 <rabbit_amqqueue_process::codeFooter+100>
   0x7f0434898ddd <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+5>:  nop
   0x7f0434898dde <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+6>:  nop
   0x7f0434898ddf <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+7>:
    add    BYTE PTR [rax],al
   0x7f0434898de1 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+9>:
    add    BYTE PTR [rax],al
   0x7f0434898de3 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+11>:
    add    BYTE PTR [rax],al
   0x7f0434898de5 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+13>:
    add    BYTE PTR [rax],al
   0x7f0434898de7 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+15>:
    add    BYTE PTR [rbx+0xd2f],cl
   0x7f0434898ded <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+21>:
    add    BYTE PTR [rax],al
   0x7f0434898def <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+23>: add    bl,cl
   0x7f0434898df1 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+25>: mov    cl,0x10
   0x7f0434898df3 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+27>:
    add    BYTE PTR [rax],al
   0x7f0434898df5 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+29>:
    add    BYTE PTR [rax],al
   0x7f0434898df7 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+31>:
    add    BYTE PTR [rax+rax*1],al
   0x7f0434898dfa <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+34>:
    add    BYTE PTR [rax],al
   0x7f0434898dfc <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+36>:
    add    BYTE PTR [rax],al
   0x7f0434898dfe <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+38>:
    add    BYTE PTR [rax],al
   0x7f0434898e00 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+40>:
    jmp    0x7f0434898e08 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4>
   0x7f0434898e02 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+42>: nop
   0x7f0434898e03 <rabbit_amqqueue_process:'-confirm_messages/3-fun-2-'/4-CodeInfoPrologue+43>:
    call   0x7f043489a2d4 <rabbit_amqqueue_process::codeFooter+82>
jhogberg commented 2 weeks ago

Thanks, I saw that that ESXi8.02 variant had support for more AVX512 instructions and figured that perhaps the optimized vector copy of fun environment variables could be broken, but the disassembly looks okay. :-|

lhoguin commented 2 weeks ago

We have managed to have rr to run. Before we schedule anything, could you provide me with a command to try to make sure it is working? Since this is a virtualised environment.

jhogberg commented 2 weeks ago

cerl -rr, you may get an error saying something about kernel.perf_event_paranoid or similar. If that happens, do as it suggests and try again.

lhoguin commented 2 weeks ago

Thanks. I don't think I need to try that as I tried this earlier and it didn't work:

$ ./rr-x86-64 record -n ls
rr: Saving execution to trace directory `/home/bitnami/.local/share/rr/ls-1'.
[FATAL src/PerfCounters.cc:331:start_counter()] Unable to open performance counter with 'perf_event_open'; are hardware perf events available? See https://github.com/rr-debugger/rr/wiki/Will-rr-work-on-my-system

So now I am asking around to see if this is something we can enable or if we're stuck.

jhogberg commented 2 weeks ago

ESXi is supposed to support virtualizing CPU performance counters since at least a major version back, so that should be possible.

lhoguin commented 2 weeks ago

Yes, thanks. We enabled them last night and are now a step further, but it is not working yet. I will keep you informed when we are successful!

lhoguin commented 2 weeks ago

Now we got rr working but cerl -rr does not. So far we have done this as far as OTP is concerned:

That means I have ~/otp/bin/cerl whereas I currently run RabbitMQ through ~/otp_install/bin/erl. I tried replacing this in the start script with ~/otp/bin/cerl but it fails to open a port (driver) at the start. I tried moving cerl to ~/otp_install/bin but that didn't work either. Of course rr record erl records the wrong process (if it records anything at all as I don't have output in .local/share/rr somehow). What are the steps I need to take to get cerl working properly in a way that I can start it from the RabbitMQ start script?

We're almost there, thanks for the assistance!

lhoguin commented 2 weeks ago

If you want to just take it from there, say so and we'll send an email for the screen sharing session.

jhogberg commented 2 weeks ago

That's fine, cerl is just a wrapper script for convenience. Let's go ahead with the session :-)

jhogberg commented 1 week ago

We've determined that this is likely caused by a hypervisor bug relating to AVX512, which we use in one single spot: the emit_copy_words helper routine used for copying fun environments and tuples.

The AVX512 variant worked fine on ESXi 7, but the exact same (generated) code caused crashes on ESXi 8. Disabling AVX512 makes it work on ESXi 8. The nature of the corruption -- always 4 successive elements being inexplicably zero, and never the first 3 elements -- suggests that the upper 256 bits of a 512-bit register are not properly saved and/or restored.

We'll close this ticket for now as it'll be tossed over the fence to the VMware folks, hopefully they can figure out what's going on.

lhoguin commented 1 week ago

Thank you very much for your assistance!! We will keep you informed.

lhoguin commented 1 week ago

The analysis was correct!

On the guest env's side a fast switch may occur while using AVX512. The fast switch does not save/restore the FPU state. Then the use of AVX256 on the hypervisor's side may reset the upper half of the AVX512 register zmm0, leading to corruption in the guest when it resumes.

In our case the fast switch was to do vSAN related operations, and the vSAN memory code was the one using AVX256 and corrupting the memory.

The fix is already available in ESXi 8.0.3 for what it's worth. We weren't the firsts running into this.

Thanks again for all the help! It was a fun one. Cheers.