DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.62k stars 557 forks source link

[jdk8] SPECjvm 2008 tests won't run #3733

Open rkgithubs opened 5 years ago

rkgithubs commented 5 years ago

we are seeing that SPECjvm 2008 runs won't even start the warm-up phase when launched with drrun. Typically specjvm runs may look like this:

/home/rahul/jdk1.8.0_201/bin/java -jar SPECjvm2008.jar -ikv -wt 15 -it 30 -bt 2 scimark.sparse.small

SPECjvm2008 Peak
  Properties file:   none
  Benchmarks:        scimark.sparse.small

with drrun we never get to this first message. I do see two threads running for short period but not convinced runs is successful since it never gets to warm-up and execution phase of the test. Although memory utilization is roughly 11GB which is quite high for sparse.small

/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/bin64/drrun -s 60 -debug -loglevel 3 -vm_size 1G -no_enable_reset -disable_traces -- ~/rahul/jdk1.8.0_201/bin/java -jar SPECjvm2008.jar -ikv -wt 15 -it 30 -bt 2 scimark.sparse.small

<log dir=/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/bin64/../logs/java.59563.00000000>

<Starting application /root/rahul/jdk1.8.0_201/bin/java (59563)>
<Initial options = -no_dynamic_options -loglevel 3 -code_api -stack_size 56K -signal_stack_size 32K -disable_traces -no_enable_traces -max_elide_jmp 0 -max_elide_call 0 -no_shared_traces -bb_ibl_targets -no_shared_trace_ibl_routine -no_enable_reset -no_reset_at_switch_to_os_at_vmm_limit -reset_at_vmm_percent_free_limit 0 -no_reset_at_vmm_full -reset_at_commit_free_limit 0K -reset_every_nth_pending 0 -vm_size 1048576K -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/lib64/debug/libdynamorio.so' 0x00007f2e11bd7580
>
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<(1+x) Handling our fault in a TRY at 0x00007f2e11e20d7c>
<spurious rep/repne prefix @0x00007f2e11994f96 (f2 41 ff e3): >
<writing to executable region.>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00000000491dc000-0x0000000089042000 prot=0x00000000
        cache says: 0x000000004904e000-0x0000000089042000 prot=0x00000000
>

attached log debuglevel 3 for the java pid java.log.zip

java.0.59824.zip

derekbruening commented 3 years ago

translate_mcontext is fail and we try to recreate_app_pc and it returns 0x0 here Looks like the issue is similar with #307

Looks like #307 is specific to trace building: but I thought you were running with -disable_traces? #307 should be impossible with -disable_traces.

kuhanov commented 3 years ago

-disable_traces

yes, we use '-disable_traces' in all our runs under DynamoRIO.

kuhanov commented 3 years ago

To summarize. Is it possible to skip FAKE_TAG setting or not? Is workflow correct after that? do we just spend more time on searching?

I believe removing that FAKE_TAG needs a corresponding change to add tag clearing once all threads have exited the cache in the later stages of shared fragment deletion. Without that it will keep missing in the table, causing potentially severe performance problems.

Ok. So we disable it for our experiments to exclude crashes now. It's sad that the performance suffered. Kirill

derekbruening commented 3 years ago

-disable_traces

yes, we use '-disable_traces' in all our runs under DynamoRIO.

OK so the translation failure would be different from #307. Would it be possible to get further information on the failure? Is it inside selfmod mangling?

derekbruening commented 3 years ago

I wonder if you could disable this line in fragment_prepare_for_removal_from_table():

        ftable->table[hindex].start_pc_fragment = pending_delete_pc;

And keep the line that disables the lookup (ftable->table[hindex].tag_fragment = FAKE_TAG;). Then lookups are disabled and the only issue is a thread that just did the lookup and is about to jump to the target fragment in the cache. But that thread could already be inside the target anyway. I don't recall a scheme where a synch is done and a thread left alone at that point in the IBL while the fragment is truly deleted from the cache: the thread would be translated somewhere fresh.

Does that solve the crashes without affecting performance?

derekbruening commented 3 years ago

I filed the target_delete bug separately as #5061 to better track its fix

kuhanov commented 3 years ago

I wonder if you could disable this line in fragment_prepare_for_removal_from_table():

        ftable->table[hindex].start_pc_fragment = pending_delete_pc;

And keep the line that disables the lookup (ftable->table[hindex].tag_fragment = FAKE_TAG;). Then lookups are disabled and the only issue is a thread that just did the lookup and is about to jump to the target fragment in the cache. But that thread could already be inside the target anyway. I don't recall a scheme where a synch is done and a thread left alone at that point in the IBL while the fragment is truly deleted from the cache: the thread would be translated somewhere fresh.

Does that solve the crashes without affecting performance?

No crashes, no performance degradations in case of

//        ftable->table[hindex].start_pc_fragment = pending_delete_pc;
        ftable->table[hindex].tag_fragment = FAKE_TAG;

Thanks a lot

kuhanov commented 3 years ago

further

master_signal_handler_C()->record_pending_signal()-> translate_sigcontext()->translate_mcontext()->recreate_app_state()->recreate_app_state_internal()->recreate_app_state_from_ilist

In bad situation we go thriough instruction list in fragment and got miss

recreate_app : looking for 0x00007f6f9d6fc327 in frag @ 0x00007f6f9d6fc31d (tag 0x00007f6d951c757c)
cache pc 0x00007f6f9d6fc31d vs 0x00007f6f9d6fc327
cache pc 0x00007f6f9d6fc326 vs 0x00007f6f9d6fc327
cache pc 0x00007f6f9d6fc330 vs 0x00007f6f9d6fc327
recreate_app -- WARNING: cache pc 0x00007f6f9d6fc330 != 0x00007f6f9d6fc327, probably prefix instruction
recreate_app -- found valid state pc 0x00007f6d951c757c
recreate_app -- found ok pc 0x00007f6d951c757c

BUT if print instructions from gdb for both frag @ 0x00007f6f9d6fc31d and tag 0x00007f6d951c757c we could see that the step between instructions are not the same like we have inside DynamoRIO ### (probably incorrect instruction length calculation. cache pc 0x00007f6f9d6fc31d is 'movabs $0x7f6fe08a1000,%r10' and DRIO calculate size of 9 bytes BUT in real it takes 10 bytes). And we catch 0x00007f6f9d6fc327 in debugger

(gdb) x /5i 0x00007f6f9d6fc31d
   0x7f6f9d6fc31d:      movabs $0x7f6fe08a1000,%r10
   0x7f6f9d6fc327:      test   %eax,(%r10)
   0x7f6f9d6fc32a:      cmp    %ebx,%r8d
   0x7f6f9d6fc32d:      jge    0x7f6f9d9a04cc
   0x7f6f9d6fc333:      jmpq   0x7f6f9d9a04cc
(gdb) x /5i 0x00007f6d951c757c
   0x7f6d951c757c:      movabs $0x7f6fe08a1000,%r10
   0x7f6d951c7586:      test   %eax,(%r10)
   0x7f6d951c7589:      cmp    %ebx,%r8d
   0x7f6d951c758c:      jge    0x7f6d951c75a9
   0x7f6d951c758e:      mov    0x10(%rdi),%r9

Kirill

derekbruening commented 3 years ago

If you could print the raw bytes for both of the movabs instructions (sthg I wish gdb would do by default...disas/r does it for a function) -- is there a non-standard prefix on the app version?

derekbruening commented 3 years ago

And if possible print the instr_t for the movabs instr in the recreated instrlist. Need to find what is causing the different encodings.

derekbruening commented 3 years ago

Is this fragment using stored translations, or did it recreate the list? (More log output should show this.) If it's stored translations this may be a bug in those and have nothing to do with the decoder/encoder.

kuhanov commented 3 years ago

is there a non-standard prefix on the app versio


(gdb) x /10i 0x00007fbf3678fbf1
   0x7fbf3678fbf1:      mov    %rbp,0x10(%rbx)
   0x7fbf3678fbf5:      mov    %rbx,%r10
   0x7fbf3678fbf8:      shr    $0x9,%r10
   0x7fbf3678fbfc:      movabs $0x7f7f985af000,%r11
   0x7fbf3678fc06:      movb   $0x0,(%r11,%r10,1)
   0x7fbf3678fc0b:      mov    %rbx,%rax
   0x7fbf3678fc0e:      add    $0x40,%rsp
   0x7fbf3678fc12:      pop    %rbp
   0x7fbf3678fc13:      movabs $0x7fbf794af000,%r10
   0x7fbf3678fc1d:      test   %eax,(%r10)
(gdb) x /44b  0x00007fbf3678fbf1
0x7fbf3678fbf1: 0x48    0x89    0x6b    0x10    0x4c    0x8b    0xd3    0x49
0x7fbf3678fbf9: 0xc1    0xea    0x09    0x49    0xbb    0x00    0xf0    0x5a
0x7fbf3678fc01: 0x98    0x7f    0x7f    0x00    0x00    0x43    0xc6    0x04
0x7fbf3678fc09: 0x13    0x00    0x48    0x8b    0xc3    0x48    0x83    0xc4
0x7fbf3678fc11: 0x40    0x5d    0x49    0xba    0x00    0xf0    0x4a    0x79
0x7fbf3678fc19: 0xbf    0x7f    0x00    0x00

(gdb) x /10i 0x00007fbd2d1c8e04
   0x7fbd2d1c8e04:      mov    %rbp,0x10(%rbx)
   0x7fbd2d1c8e08:      mov    %rbx,%r10
   0x7fbd2d1c8e0b:      shr    $0x9,%r10
   0x7fbd2d1c8e0f:      movabs $0x7f7f985af000,%r11
   0x7fbd2d1c8e19:      movb   $0x0,(%r11,%r10,1)
   0x7fbd2d1c8e1e:      mov    %rbx,%rax
   0x7fbd2d1c8e21:      add    $0x40,%rsp
   0x7fbd2d1c8e25:      pop    %rbp
   0x7fbd2d1c8e26:      movabs $0x7fbf794af000,%r10
   0x7fbd2d1c8e30:      test   %eax,(%r10)
(gdb)  x /44b 0x7fbd2d1c8e04
0x7fbd2d1c8e04: 0x48    0x89    0x6b    0x10    0x4c    0x8b    0xd3    0x49
0x7fbd2d1c8e0c: 0xc1    0xea    0x09    0x49    0xbb    0x00    0xf0    0x5a
0x7fbd2d1c8e14: 0x98    0x7f    0x7f    0x00    0x00    0x43    0xc6    0x04
0x7fbd2d1c8e1c: 0x13    0x00    0x48    0x8b    0xc3    0x48    0x83    0xc4
0x7fbd2d1c8e24: 0x40    0x5d    0x49    0xba    0x00    0xf0    0x4a    0x79
0x7fbd2d1c8e2c: 0xbf    0x7f    0x00    0x00
kuhanov commented 3 years ago

instr in the recreated instrlist.

inst->bytes us NULL pointer in bad case - no bytes

kuhanov commented 3 years ago

there are cases when we calculate size of movabs correctly (inst->bytes is NULL too)

cache pc 0x00007fbb58cf79dd vs 0x00007fbb58cf79fd 2 0x00007fbb9b2788e7
cache pc 0x00007fbb58cf79df vs 0x00007fbb58cf79fd 3 0x00007fbb9b2788e9
cache pc 0x00007fbb58cf79e2 vs 0x00007fbb58cf79fd 4 0x00007fbb9b2788ec
**_cache pc 0x00007fbb58cf79e6 vs 0x00007fbb58cf79fd size 10 inst->bytes 0x0000000000000000_**
cache pc 0x00007fbb58cf79f0 vs 0x00007fbb58cf79fd 3 0x00007fbb9b2788f7
cache pc 0x00007fbb58cf79f3 vs 0x00007fbb58cf79fd 3 0x00007fbb9b2788fa
cache pc 0x00007fbb58cf79f6 vs 0x00007fbb58cf79fd 4 0x00007fbb9b2788fd
cache pc 0x00007fbb58cf79fa vs 0x00007fbb58cf79fd 3 0x00007fbb9b278901
cache pc 0x00007fbb58cf79fd vs 0x00007fbb58cf79fd 6 0x00007fbb9b278904
2 recreate_app -- found valid state pc 0x00007fbb9b278904
1 recreate_app -- found ok pc 0x00007fbb9b278904

(gdb) x /10i 0x00007fbb58cf79dd
   0x7fbb58cf79dd:      mov    %eax,%eax
   0x7fbb58cf79df:      and    %rbx,%rax
   0x7fbb58cf79e2:      mov    %rax,-0x18(%rbp)
   0x7fbb58cf79e6:      movabs $0x7fbb9c59bf90,%rax
   0x7fbb58cf79f0:      mov    (%rax),%rax
   0x7fbb58cf79f3:      mov    %rax,%rdx
   0x7fbb58cf79f6:      mov    -0x18(%rbp),%rax
   0x7fbb58cf79fa:      add    %rdx,%rax
   0x7fbb58cf79fd:      movl   $0x1,(%rax)
   0x7fbb58cf7a03:      add    $0x28,%rsp

problem os happened when all inst-s have NULL pointer for bytes

recreate_app : looking for 0x00007fbb598e42d8 in frag @ 0x00007fbb598e42b9 (tag 0x00007fb9511ce270)
                                                                                         size           inst->bytes
cache pc 0x00007fbb598e42b9 vs 0x00007fbb598e42d8 9 0x0000000000000000
cache pc 0x00007fbb598e42c2 vs 0x00007fbb598e42d8 10 0x0000000000000000
cache pc 0x00007fbb598e42cc vs 0x00007fbb598e42d8 2 0x0000000000000000
cache pc 0x00007fbb598e42ce vs 0x00007fbb598e42d8 3 0x0000000000000000
cache pc 0x00007fbb598e42d1 vs 0x00007fbb598e42d8 2 0x0000000000000000
cache pc 0x00007fbb598e42d3 vs 0x00007fbb598e42d8 9 0x0000000000000000
cache pc 0x00007fbb598e42dc vs 0x00007fbb598e42d8 5 0x0000000000000000
recreate_app -- WARNING: cache pc 0x00007fbb598e42dc != 0x00007fbb598e42d8, probably prefix instruction
recreate_app -- invalid state: unsup=1 in-mangle=1 xl8=0x00007fb9511ce270 walk=0x00007fb9511ce270
recreate_app -- not able to fully recreate context, pc is in added instruction from mangling
1 recreate_app -- found ok pc 0x00007fb9511ce270
derekbruening commented 3 years ago

Is this fragment using stored translations, or did it recreate the list? (More log output should show this.) If it's stored translations this may be a bug in those and have nothing to do with the decoder/encoder.

Answering my own question: since you have an instrlist, it must not be using stored info, and you even listed recreate_app_state_from_ilist above.

Re: instr_t.bytes being NULL: I don't think that means much: for re-created-ilist app instrs that is probably what would be expected. It's instr_t.translation that would point to the original app encoding. For synthetic instrs there are cases that do not cache the encoding so again bytes being NULL is not necessarily an indication that something is wrong.

I was asking to dump all the fields of the instr_t for the one that has the length of 9.

kuhanov commented 3 years ago

I was asking to dump all the fields of the instr_t for the one that has the length of 9.

3 examples

(gdb) print *inst
$1 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffdb0df34cc "\305\370wALJ\b\003", opcode = 56, rip_rel_pos = 0 '\000', num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {
        kind = 5 '\005', size = 0 '\000', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0, shift = 0, flags = 0}, value = {immed_int = 2, immed_int_multi_part = {low = 2, high = 0}, immed_float = 2.80259693e-45,
          immed_double = 9.8813129168249309e-324, pc = 0x2 <error: Cannot access memory at address 0x2>, instr = 0x2, reg = 2, base_disp = {disp = 2, base_reg = 0, index_reg = 0, scale = 0 '\000', encode_zero_disp = 0 '\000',
            force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'}, addr = 0x2}}, srcs = 0x0, dsts = 0x7ffdb943da28}, label_data = {data = {5, 2, 0, 140727711685160}}}, prefixes = 0, eflags = 0,
  note = 0x0, prev = 0x0, next = 0x7ffdb943ca70}
(gdb) x /i 0x7ffdb0df34cc
   0x7ffdb0df34cc:      vzeroupper
(gdb) x /i 0x00007fffb51fc781
   0x7fffb51fc781:      vzeroupper
(gdb) print len
$2 = 9

(gdb) x /3i 0x7fffb51fc781
   0x7fffb51fc781:      vzeroupper
   0x7fffb51fc784:      movl   $0x5,0x308(%r15)
   0x7fffb51fc78f:      mov    %r15d,%ecx
(gdb) x /4i 0x00007fffb4cc0ac1
   0x7fffb4cc0ac1:      mov    %eax,-0x16000(%rsp)
   0x7fffb4cc0ac8:      push   %rbp
   0x7fffb4cc0ac9:      mov    %rsp,%rbp
   0x7fffb4cc0acc:      sub    $0x10,%rsp
(gdb) x /4i 0x7ffdb0dd82b0
   0x7ffdb0dd82b0:      mov    %eax,-0x16000(%rsp)
   0x7ffdb0dd82b7:      push   %rbp
   0x7ffdb0dd82b8:      mov    %rsp,%rbp
   0x7ffdb0dd82bb:      sub    $0x10,%rsp

(gdb) print *inst
$3 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffdb0dd82b0 "\211\204$", opcode = 56, rip_rel_pos = 0 '\000',
  num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {kind = 5 '\005', size = 0 '\000', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0, shift = 0, flags = 0},
        value = {immed_int = 2, immed_int_multi_part = {low = 2, high = 0}, immed_float = 2.80259693e-45, immed_double = 9.8813129168249309e-324,
          pc = 0x2 <error: Cannot access memory at address 0x2>, instr = 0x2, reg = 2, base_disp = {disp = 2, base_reg = 0, index_reg = 0, scale = 0 '\000',
            encode_zero_disp = 0 '\000', force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'}, addr = 0x2}}, srcs = 0x0,
      dsts = 0x7ffdb87183c0}, label_data = {data = {5, 2, 0, 140727697900480}}}, prefixes = 0, eflags = 0, note = 0x0, prev = 0x0, next = 0x7ffdba712a68}
(gdb) print len
$4 = 9
(gdb) print *inst
$1 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffff6d78bbb "H\215\005\246ƺ", opcode = 57,
  rip_rel_pos = 0 '\000', num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {kind = 1 '\001', size = 6 '\006', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0,
          shift = 0, flags = 0}, value = {immed_int = 140737346949736, immed_int_multi_part = {low = -141405592, high = 32767}, immed_float = -5.9355214e+33,
          immed_double = 6.9533488214704906e-310, pc = 0x7ffff7925268 "", instr = 0x7ffff7925268, reg = 21096, base_disp = {disp = -141405592, base_reg = 255,
            index_reg = 127, scale = 0 '\000', encode_zero_disp = 0 '\000', force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'},
          addr = 0x7ffff7925268}}, srcs = 0x0, dsts = 0x7ffdb45e2bf8}, label_data = {data = {1537, 140737346949736, 0, 140727629523960}}}, prefixes = 0, eflags = 0,
  note = 0x0, prev = 0x0, next = 0x7ffdb45e21e0}
(gdb) x /3i 0x7ffff6d78bbb
   0x7ffff6d78bbb:      lea    0xbac6a6(%rip),%rax        # 0x7ffff7925268
   0x7ffff6d78bc2:      mov    (%rax),%rax
   0x7ffff6d78bc5:      test   %rax,%rax
(gdb) print len
$3 = 10
RabbitPowerr commented 3 years ago

In addition, It's looks like a bug in third case : operation code of instruction = 57 (OP_mov_imm) , but why not 61(OP_lea)? And whats are the different beetwen OP_mov_ld and OP_mov_st , because in our cases frequently op_code = OP_mov_st or OP_mov_ld, when we see DR crash.

derekbruening commented 3 years ago

I don't understand the output in the prior comment: the cases above where the size is wrong (9 instead of 10 bytes) involve movabs instructions like movabs $0x7fbb9c59bf90,%rax. But the 3 cases at https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-909013554 are vzeroupper, a store mov %eax,-0x16000(%rsp), and a lea: none of which seem related to the problem we're trying to debug?

kuhanov commented 3 years ago

I don't understand the output in the prior comment: the cases above where the size is wrong (9 instead of 10 bytes) involve movabs instructions like movabs $0x7fbb9c59bf90,%rax. But the 3 cases at #3733 (comment) are vzeroupper, a store mov %eax,-0x16000(%rsp), and a lea: none of which seem related to the problem we're trying to debug?

these are th same issue. We have the same crash, the same incorrect size calcaulation, the same null bytes for instructions. Kirill

kuhanov commented 3 years ago

Fox example, lea instruction case gdb shows that it takes 7b but DRIO calculation responds len=10 Kirill

derekbruening commented 3 years ago

What is this len variable -- what is the callstack? What are the raw instruction bytes for these cases?

kuhanov commented 3 years ago

What is this len variable -- what is the callstack? What are the raw instruction bytes for these cases?

len is inside recreate_app_state_from_ilist

    for (inst = instrlist_first(ilist); inst; inst = instr_get_next(inst)) {
        int len = instr_length(tdcontext, inst);
kuhanov commented 3 years ago

What are the raw instruction bytes for these cases?

or we missed there. Let's us rerun benchmark and prepare another sample Kirill

RabbitPowerr commented 3 years ago

DynamoRio output

cache pc 0x00007fffb4f44385 vs 0x00007fffb4f44394 INST_LEN = 9  ORIGINAL = 0x0000000000000000  THREAD = 0x00000000000052c0
cache pc 0x00007fffb4f4438e vs 0x00007fffb4f44394 INST_LEN = 10  ORIGINAL = 0x0000000000000000  THREAD = 0x00000000000052c0
cache pc 0x00007fffb4f44398 vs 0x00007fffb4f44394 INST_LEN = 2  ORIGINAL = 0x0000000000000000  THREAD = 0x00000000000052c0

Gdb output

   0x7fffb4f44385:      add    $0x40,%rsp
   0x7fffb4f44389:      pop    %rbp
   0x7fffb4f4438a:      movabs $0x7ffff7da8000,%r10
   0x7fffb4f44394:      test   %eax,(%r10)

Raw bytes

0x7fffb4f44385: 0x48    0x83    0xc4    0x40    0x5d    0x49    0xba    0x00
0x7fffb4f4438d: 0x80    0xda    0xf7    0xff    0x7f    0x00    0x00    0x41
0x7fffb4f44395: 0x85    0x02

First suspicious intr

$1 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffdb0d8cf69 "H\203\304@]I\272", opcode = 56, rip_rel_pos = 0 '\000', num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {
        kind = 5 '\005', size = 0 '\000', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0, shift = 0, flags = 0}, value = {immed_int = 2, immed_int_multi_part = {low = 2, high = 0}, immed_float = 2.80259693e-45,
          immed_double = 9.8813129168249309e-324, pc = 0x2 <error: Cannot access memory at address 0x2>, instr = 0x2, reg = 2, base_disp = {disp = 2, base_reg = 0, index_reg = 0, scale = 0 '\000', encode_zero_disp = 0 '\000',
            force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'}, addr = 0x2}}, srcs = 0x0, dsts = 0x7ffdb967a498}, label_data = {data = {5, 2, 0, 140727714030744}}}, prefixes = 0, eflags = 0,
  note = 0x0, prev = 0x0, next = 0x7ffdb9679ac0}
derekbruening commented 3 years ago

operation code of instruction = 57 (OP_mov_imm) , but why not 61(OP_lea)?

A rip-rel lea is mangled into mov_imm when the rip-rel doesn't reach from the code cache. The issue may involve inconsistent mangling.

The other cases are all DR-inserted mangling (hence why no bytes value):

(gdb) p /x 2149646336
$1 = 0x80210000

=> INSTR_OUR_MANGLING

56 = OP_mov_st 2 == DR_REG_RCX

So a spill. Dest is out-of-line but presumably TLS.

So the recreated instrlist has extra/different instructions and that's why sizes do not match b/c it's looking at different instruction sequences? So it's not anything with decoder/encoder sizes, it's different instructions. Is it inconsistent rip-rel mangling, or is it more than that -- Is this jitted app code and sthg changed but wasn't detected properly?

Dumping the full recreated instrlist would be helpful: instrlist_disassemble(), and comparing to the full block's original app code and full fragment in the cache.

kuhanov commented 3 years ago

Dumping the full recreated instrlist would be helpful: instrlist_disassemble(), and comparing to the full block's original app code and full fragment in the cache.

The one strange thing that instrlist_disassemble doesn't stop on jmp +35 L4 @0x00007f531de07958 e9 1f e2 3b ef jmp $0x00007f530d1da134

recreate_app : looking for 0x00007f5517e73d58 in frag @ 0x00007f5517e73d49 (tag 0x00007f530d1da134)
TAG  0x00007f530d1da134
 +0    m4 @0x00007f531de09cd0  65 48 89 0c 25 10 00 mov    %rcx -> %gs:0x10[8byte]
                               00 00
 +9    m4 @0x00007f531de08d30  48 b9 00 00 00 00 00 mov    $0x0000000000000000 -> %rcx
                               00 00 00
 +19   m4 @0x00007f531de08ac8  ff 01                inc    (%rcx)[4byte] -> (%rcx)[4byte]
 +21   m4 @0x00007f531de05378  83 39 14             cmp    (%rcx)[4byte] $0x00000014
 +24   m4 @0x00007f531de07048  7c fe                jl     @0x00007f531de075f0[8byte]
 +26   m4 @0x00007f531de08cb0  65 48 8b 0c 25 10 00 mov    %gs:0x10[8byte] -> %rcx
                               00 00
 +35   L4 @0x00007f531de07958  e9 1f e2 3b ef       jmp    $0x00007f530d1da134
 +40   m4 @0x00007f531de075f0                       <label>
 +40   m4 @0x00007f531de07d28  65 48 89 34 25 08 00 mov    %rsi -> %gs:0x08[8byte]
                               00 00
 +49   m4 @0x00007f5320906040  65 48 89 3c 25 18 00 mov    %rdi -> %gs:0x18[8byte]
                               00 00
 +58   m4 @0x00007f531de04b58  48 be 34 a1 1d 0d 53 mov    $0x00007f530d1da134 -> %rsi
                               7f 00 00
 +68   m4 @0x00007f53209060a8  48 bf 34 a1 1d 0d 53 mov    $0x00007f530d1da134 -> %rdi
                               7f 00 00
 +78   m4 @0x00007f531de08010  a6                   cmps   %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi -> %rsi %rdi
 +79   m4 @0x00007f531de04e28  0f 85 fa ff ff ff    jnz    @0x00007f531de05ce8[8byte]
 +85   m4 @0x00007f531de08578  48 b9 34 a1 1d 0d 53 mov    $0x00007f530d1da134 -> %rcx
                               7f 00 00
 +95   m4 @0x00007f531de08960  48 3b f1             cmp    %rsi %rcx
 +98   m4 @0x00007f531de061a0  48 b9 12 00 00 00 00 mov    $0x0000000000000012 -> %rcx
                               00 00 00
 +108  m4 @0x00007f531de05ed0  0f 8d fa ff ff ff    jnl    @0x00007f531de09ad0[8byte]
 +114  m4 @0x00007f531de04d58  48 bf 46 a1 1d 0d 53 mov    $0x00007f530d1da146 -> %rdi
                               7f 00 00
 +124  m4 @0x00007f531de098e8  48 be 46 a1 1d 0d 53 mov    $0x00007f530d1da146 -> %rsi
                               7f 00 00
 +134  m4 @0x00007f531de09ad0                       <label>
 +134  m4 @0x00007f531de05b00  f3 a6                rep cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi %rcx -> %rsi %rdi %rcx
 +136  m4 @0x00007f531de05ce8                       <label>
 +136  m4 @0x00007f531de06fc8  65 48 8b 0c 25 10 00 mov    %gs:0x10[8byte] -> %rcx
                               00 00
 +145  m4 @0x00007f531de084f8  65 48 8b 34 25 08 00 mov    %gs:0x08[8byte] -> %rsi
                               00 00
 +154  m4 @0x00007f531de086e0  65 48 8b 3c 25 18 00 mov    %gs:0x18[8byte] -> %rdi
                               00 00
 +163  L4 @0x00007f531de07288  0f 85 1e e2 3b ef    jnz    $0x00007f530d1da134
 +169  L3 @0x00007f531de09e38  48 83 c4 30          add    $0x0000000000000030 %rsp -> %rsp
 +173  L3 @0x00007f531de05560  5d                   pop    %rsp (%rsp)[8byte] -> %rbp %rsp
 +174  L3 @0x00007f531de054f8  49 ba 00 90 cb 5a 55 mov    $0x00007f555acb9000 -> %r10
                               7f 00 00
 +184  L3 @0x00007f53209062a8  41 85 02             test   (%r10)[4byte] %eax
 +187  m4 @0x00007f531de07188  65 48 89 0c 25 10 00 mov    %rcx -> %gs:0x10[8byte]
                               00 00
 +196  m4 @0x00007f531de07ad8  59                   pop    %rsp (%rsp)[8byte] -> %rcx %rsp
 +197  L4 @0x00007f531de088e0  e9 eb 08 ce f8       jmp    $0x00007f5516afc800 <shared_bb_ibl_ret>
END 0x00007f530d1da134

cache pc 0x00007f5517e73d49 vs 0x00007f5517e73d58 9 0x0000000000000000
cache pc 0x00007f5517e73d52 vs 0x00007f5517e73d58 10 0x0000000000000000
cache pc 0x00007f5517e73d5c vs 0x00007f5517e73d58 2 0x0000000000000000
recreate_app -- WARNING: cache pc 0x00007f5517e73d5c != 0x00007f5517e73d58, probably prefix instruction
2 recreate_app -- found valid state pc 0x00007f530d1da134
1 recreate_app -- found ok pc 0x00007f530d1da134
(gdb) x /34i 0x00007f5517e73d49
   0x7f5517e73d49:      add    $0x30,%rsp
   0x7f5517e73d4d:      pop    %rbp
   0x7f5517e73d4e:      movabs $0x7f555acb9000,%r10
   0x7f5517e73d58:      test   %eax,(%r10)
   0x7f5517e73d5b:      mov    %rcx,%gs:0x10
   0x7f5517e73d64:      pop    %rcx
   0x7f5517e73d65:      jmpq   0x7f5516afc8ab
   0x7f5517e73d6a:      and    %bh,%ah
   0x7f5517e73d6c:      rex.W stos %al,%es:(%rdi)
   0x7f5517e73d6e:      xchg   %ah,(%rax)
   0x7f5517e73d70:      push   %rbx
   0x7f5517e73d71:      jg     0x7f5517e73d73
   0x7f5517e73d73:      add    %ah,0x65(%rdi)
   0x7f5517e73d76:      movabs 0xc8b486500000000,%rax
   0x7f5517e73d80:      and    $0x10,%eax
   0x7f5517e73d85:      pop    %r11
   0x7f5517e73d87:      pop    %rsp
   0x7f5517e73d88:      pop    %r10
   0x7f5517e73d8a:      pop    %r9
   0x7f5517e73d8c:      pop    %r8
   0x7f5517e73d8e:      pop    %rcx
   0x7f5517e73d8f:      pop    %rdx
   0x7f5517e73d90:      pop    %rsi
   0x7f5517e73d91:      pop    %rdi
   0x7f5517e73d92:      cmp    %rax,%r15
   0x7f5517e73d95:      je     0x7f5517ac64c1
   0x7f5517e73d9b:      jmpq   0x7f5517ac64c1
   0x7f5517e73da0:      shrb   $0x0,0x7f532086(%rdx)
   0x7f5517e73da7:      add    %ah,0x65(%rdi)
   0x7f5517e73daa:      movabs 0xc8b486500000000,%rax
   0x7f5517e73db4:      and    $0x10,%eax
   0x7f5517e73db9:      pop    %rax
   0x7f5517e73dba:      movabs $0x0,%r10
   0x7f5517e73dc4:      mov    %r10,0x270(%r15)
(gdb)  x /200b 0x00007f5517e73d49
0x7f5517e73d49: 0x48    0x83    0xc4    0x30    0x5d    0x49    0xba    0x00
0x7f5517e73d51: 0x90    0xcb    0x5a    0x55    0x7f    0x00    0x00    0x41
0x7f5517e73d59: 0x85    0x02    0x65    0x48    0x89    0x0c    0x25    0x10
0x7f5517e73d61: 0x00    0x00    0x00    0x59    0xe9    0x41    0x8b    0xc8
0x7f5517e73d69: 0xfe    0x20    0xfc    0x48    0xaa    0x86    0x20    0x53
0x7f5517e73d71: 0x7f    0x00    0x00    0x67    0x65    0x48    0xa1    0x00
0x7f5517e73d79: 0x00    0x00    0x00    0x65    0x48    0x8b    0x0c    0x25
0x7f5517e73d81: 0x10    0x00    0x00    0x00    0x41    0x5b    0x5c    0x41
0x7f5517e73d89: 0x5a    0x41    0x59    0x41    0x58    0x59    0x5a    0x5e
0x7f5517e73d91: 0x5f    0x4c    0x3b    0xf8    0x0f    0x84    0x26    0x27
0x7f5517e73d99: 0xc5    0xff    0xe9    0x21    0x27    0xc5    0xff    0xc0
0x7f5517e73da1: 0xaa    0x86    0x20    0x53    0x7f    0x00    0x00    0x67
0x7f5517e73da9: 0x65    0x48    0xa1    0x00    0x00    0x00    0x00    0x65
0x7f5517e73db1: 0x48    0x8b    0x0c    0x25    0x10    0x00    0x00    0x00
0x7f5517e73db9: 0x58    0x49    0xba    0x00    0x00    0x00    0x00    0x00
0x7f5517e73dc1: 0x00    0x00    0x00    0x4d    0x89    0x97    0x70    0x02
0x7f5517e73dc9: 0x00    0x00    0x49    0xba    0x00    0x00    0x00    0x00
0x7f5517e73dd1: 0x00    0x00    0x00    0x00    0x4d    0x89    0x97    0x80
0x7f5517e73dd9: 0x02    0x00    0x00    0x49    0xba    0x00    0x00    0x00
0x7f5517e73de1: 0x00    0x00    0x00    0x00    0x00    0x4d    0x89    0x97
0x7f5517e73de9: 0x78    0x02    0x00    0x00    0x49    0x81    0x7f    0x08
0x7f5517e73df1: 0x00    0x00    0x00    0x00    0x0f    0x84    0xf4    0x26
0x7f5517e73df9: 0xc5    0xff    0xe9    0xef    0x26    0xc5    0xff    0x28
0x7f5517e73e01: 0x0e    0x6f    0x1a    0x53    0x7f    0x00    0x00    0x04
0x7f5517e73e09: 0x7f    0x9e    0x67    0x65    0x48    0xa1    0x00    0x00
(gdb) x /34i 0x00007f530d1da134
   0x7f530d1da134:      add    $0x30,%rsp
   0x7f530d1da138:      pop    %rbp
   0x7f530d1da139:      movabs $0x7f555acb9000,%r10
   0x7f530d1da143:      test   %eax,(%r10)
   0x7f530d1da146:      retq
   0x7f530d1da147:      test   %rdx,%rdx
   0x7f530d1da14a:      je     0x7f530d1da230
   0x7f530d1da150:      mov    (%rdx),%r10
   0x7f530d1da153:      mov    %r10,%r11
   0x7f530d1da156:      and    $0x7,%r11
   0x7f530d1da15a:      cmp    $0x1,%r11
   0x7f530d1da15e:      jne    0x7f530d1da237
   0x7f530d1da164:      shr    $0x8,%r10
   0x7f530d1da168:      mov    %r10d,%eax
   0x7f530d1da16b:      and    $0x7fffffff,%eax
   0x7f530d1da171:      test   %eax,%eax
   0x7f530d1da173:      je     0x7f530d1da237
   0x7f530d1da179:      mov    0x8(%rsp),%r10
   0x7f530d1da17e:      mov    0x20(%r10),%r10
   0x7f530d1da182:      mov    0x10(%r10),%r11d
   0x7f530d1da186:      and    $0x7fffffff,%eax
   0x7f530d1da18c:      test   %r11d,%r11d
   0x7f530d1da18f:      je     0x7f530d1da2a1
   0x7f530d1da195:      cmp    $0x80000000,%eax
   0x7f530d1da19a:      jne    0x7f530d1da1a4
   0x7f530d1da19c:      xor    %edx,%edx
   0x7f530d1da19e:      cmp    $0xffffffff,%r11d
   0x7f530d1da1a2:      je     0x7f530d1da1a8
   0x7f530d1da1a4:      cltd
   0x7f530d1da1a5:      idiv   %r11d
   0x7f530d1da1a8:      mov    0x18(%r10,%rdx,4),%eax
   0x7f530d1da1ad:      test   %eax,%eax
   0x7f530d1da1af:      jl     0x7f530d1da226
   0x7f530d1da1b5:      mov    0x8(%rsp),%r10
(gdb) x /200b 0x00007f530d1da134
0x7f530d1da134: 0x48    0x83    0xc4    0x30    0x5d    0x49    0xba    0x00
0x7f530d1da13c: 0x90    0xcb    0x5a    0x55    0x7f    0x00    0x00    0x41
0x7f530d1da144: 0x85    0x02    0xc3    0x48    0x85    0xd2    0x0f    0x84
0x7f530d1da14c: 0xe0    0x00    0x00    0x00    0x4c    0x8b    0x12    0x4d
0x7f530d1da154: 0x8b    0xda    0x49    0x83    0xe3    0x07    0x49    0x83
0x7f530d1da15c: 0xfb    0x01    0x0f    0x85    0xd3    0x00    0x00    0x00
0x7f530d1da164: 0x49    0xc1    0xea    0x08    0x41    0x8b    0xc2    0x81
0x7f530d1da16c: 0xe0    0xff    0xff    0xff    0x7f    0x85    0xc0    0x0f
0x7f530d1da174: 0x84    0xbe    0x00    0x00    0x00    0x4c    0x8b    0x54
0x7f530d1da17c: 0x24    0x08    0x4d    0x8b    0x52    0x20    0x45    0x8b
0x7f530d1da184: 0x5a    0x10    0x81    0xe0    0xff    0xff    0xff    0x7f
0x7f530d1da18c: 0x45    0x85    0xdb    0x0f    0x84    0x0c    0x01    0x00
0x7f530d1da194: 0x00    0x3d    0x00    0x00    0x00    0x80    0x75    0x08
0x7f530d1da19c: 0x33    0xd2    0x41    0x83    0xfb    0xff    0x74    0x04
0x7f530d1da1a4: 0x99    0x41    0xf7    0xfb    0x41    0x8b    0x44    0x92
0x7f530d1da1ac: 0x18    0x85    0xc0    0x0f    0x8c    0x71    0x00    0x00
0x7f530d1da1b4: 0x00    0x4c    0x8b    0x54    0x24    0x08    0x4d    0x8b
0x7f530d1da1bc: 0x4a    0x30    0x41    0x8b    0x49    0x10    0x3b    0xc1
0x7f530d1da1c4: 0x0f    0x83    0x83    0x00    0x00    0x00    0x4d    0x8b
0x7f530d1da1cc: 0x54    0xc1    0x18    0x4c    0x3b    0x54    0x24    0x10
0x7f530d1da1d4: 0x0f    0x84    0x5a    0xff    0xff    0xff    0x4c    0x8b
0x7f530d1da1dc: 0x54    0x24    0x08    0x49    0x8b    0x6a    0x28    0x44
0x7f530d1da1e4: 0x8b    0x55    0x10    0x41    0x3b    0xc2    0x0f    0x83
0x7f530d1da1ec: 0x96    0x00    0x00    0x00    0x8b    0x44    0x85    0x18
0x7f530d1da1f4: 0x85    0xc0    0x7c    0x2e    0x49    0xbb    0x00    0x90
kuhanov commented 3 years ago

Looks ilist have the set of prepend incorrect instructions because instrlist_disassemble() is matched with real instructions from

+169  L3 @0x00007f531de09e38  48 83 c4 30          add    $0x0000000000000030 %rsp -> %rsp
 +173  L3 @0x00007f531de05560  5d                   pop    %rsp (%rsp)[8byte] -> %rbp %rsp
 +174  L3 @0x00007f531de054f8  49 ba 00 90 cb 5a 55 mov    $0x00007f555acb9000 -> %r10
                               7f 00 00
 +184  L3 @0x00007f53209062a8  41 85 02             test   (%r10)[4byte] %eax
 +187  m4 @0x00007f531de07188  65 48 89 0c 25 10 00 mov    %rcx -> %gs:0x10[8byte]
                               00 00
 +196  m4 @0x00007f531de07ad8  59                   pop    %rsp (%rsp)[8byte] -> %rcx %rsp
 +197  L4 @0x00007f531de088e0  e9 eb 08 ce f8       jmp    $0x00007f5516afc800 <shared_bb_ibl_ret>

Looks like mangle_bb_ilist added extra 169 bytes for instructions before. So, when we move from ilist, we start from OLD address frag @ 0x00007f5517e73d49 but use mangling instructions length Kirill

derekbruening commented 3 years ago

This:

recreate_app : looking for 0x00007f5517e73d58 in frag @ 0x00007f5517e73d49 (tag 0x00007f530d1da134)
TAG  0x00007f530d1da134
 +0    m4 @0x00007f531de09cd0  65 48 89 0c 25 10 00 mov    %rcx -> %gs:0x10[8byte]
                               00 00
 +9    m4 @0x00007f531de08d30  48 b9 00 00 00 00 00 mov    $0x0000000000000000 -> %rcx
                               00 00 00
 +19   m4 @0x00007f531de08ac8  ff 01                inc    (%rcx)[4byte] -> (%rcx)[4byte]
 +21   m4 @0x00007f531de05378  83 39 14             cmp    (%rcx)[4byte] $0x00000014
 +24   m4 @0x00007f531de07048  7c fe                jl     @0x00007f531de075f0[8byte]
 +26   m4 @0x00007f531de08cb0  65 48 8b 0c 25 10 00 mov    %gs:0x10[8byte] -> %rcx
                               00 00
 +35   L4 @0x00007f531de07958  e9 1f e2 3b ef       jmp    $0x00007f530d1da134
 +40   m4 @0x00007f531de075f0                       <label>
 +40   m4 @0x00007f531de07d28  65 48 89 34 25 08 00 mov    %rsi -> %gs:0x08[8byte]
                               00 00
...

Looks like the mangling added for selfmod sandboxing. -sandbox2ro_threshold is 20==0x14. (Note all the "m4": that's a meta level 4 instr so not coming from the app but from DR's added mangling.)

So the problem is the recreate thinks there should be sandboxing mangling while the actual fragment in the code cache does not have such mangling? Look at the logs around this sandbox2ro threshold and swapping between sandboxing code from writable app pages and marking app pages read-only. What are the page protections on this app code? What happens w/ -sandbox2ro_threshold 0 -ro2sandbox_threshold 0?

derekbruening commented 3 years ago

The one strange thing that instrlist_disassemble doesn't stop on jmp +35 L4 @0x00007f531de07958 e9 1f e2 3b ef jmp $0x00007f530d1da134

I believe that's marked as an app instr to create an exit from the fragment if the threshold is reached: so it's really a synthetic jump that's not part of the original app code.

kuhanov commented 3 years ago

So the problem is the recreate thinks there should be sandboxing mangling while the actual fragment in the code cache does not have such mangling?

Yes. For example, for this one we have sanboxes instructions. but if I dump bytes before code cache address 0x00007f809098a5ed they are not the same.

(gdb) x /300b (0x00007f809098a5ed-176)
0x7f809098a53d: 0x00    0x65    0x48    0x89    0x0c    0x25    0x10    0x00
0x7f809098a545: 0x00    0x00    0x49    0x8b    0xca    0x68    0xbd    0x71
0x7f809098a54d: 0x67    0x8c    0xc7    0x44    0x24    0x04    0x7e    0x7f
0x7f809098a555: 0x00    0x00    0xe9    0x83    0x74    0xcb    0xfe    0x30
0x7f809098a55d: 0x67    0x1c    0x93    0x7e    0x7f    0x00    0x00    0x67
0x7f809098a565: 0x65    0x48    0xa1    0x00    0x00    0x00    0x00    0x65
0x7f809098a56d: 0x48    0x8b    0x0c    0x25    0x10    0x00    0x00    0x00
0x7f809098a575: 0x41    0x5b    0x5c    0x41    0x5a    0x41    0x59    0x41
0x7f809098a57d: 0x58    0x59    0x5a    0x5e    0x5f    0x4c    0x3b    0xf8
0x7f809098a585: 0x0f    0x84    0x5a    0xda    0xf6    0xff    0xe9    0x55
0x7f809098a58d: 0xda    0xf6    0xff    0x70    0xc1    0x22    0x95    0x7e
0x7f809098a595: 0x7f    0x00    0x00    0x67    0x65    0x48    0xa1    0x00
0x7f809098a59d: 0x00    0x00    0x00    0x65    0x48    0x8b    0x0c    0x25
0x7f809098a5a5: 0x10    0x00    0x00    0x00    0x66    0x8b    0x47    0x08
0x7f809098a5ad: 0x66    0x89    0x46    0x08    0x49    0xba    0x98    0x5d
0x7f809098a5b5: 0x37    0xd3    0x80    0x7f    0x00    0x00    0x41    0xff
0x7f809098a5bd: 0x02    0x48    0x33    0xc0    0xc9    0x65    0x48    0x89
0x7f809098a5c5: 0x0c    0x25    0x10    0x00    0x00    0x00    0x59    0xe9
0x7f809098a5cd: 0xda    0x72    0xcb    0xfe    0x30    0x00    0x00    0xb0
0x7f809098a5d5: 0xc1    0x22    0x95    0x7e    0x7f    0x00    0x00    0x67
0x7f809098a5dd: 0x65    0x48    0xa1    0x00    0x00    0x00    0x00    0x65
0x7f809098a5e5: 0x48    0x8b    0x0c    0x25    0x10    0x00    0x00    0x00
0x7f809098a5ed: 0x48    0x83    0xc4    0x30    0x5d    0x49    0xba    0x00
0x7f809098a5f5: 0xe0    0x7f    0xd3    0x80    0x7f    0x00    0x00    0x41
0x7f809098a5fd: 0x85    0x02    0x65    0x48    0x89    0x0c    0x25    0x10
0x7f809098a605: 0x00    0x00    0x00    0x59    0xe9    0x9d    0x72    0xcb
0x7f809098a60d: 0xfe    0x6d    0x90    0x90    0x2d    0x47    0x99    0x7e
recreate_app : looking for 0x00007f809098a5fc in frag @ 0x00007f809098a5ed (tag 0x00007f7e8c7f1938)
recreate_app : pc is in F(0x00007f7e8c7f1938)
TAG  0x00007f7e8c7f1938
 +0    m4 @0x00007f7e95cb0518 opcode=56 65 48 89 0c 25 10 00 mov    %rcx -> %gs:0x10[8byte]
                               00 00
 +9    m4 @0x00007f7e95cac158 opcode=57 48 b9 00 00 00 00 00 mov    $0x0000000000000000 -> %rcx
                               00 00 00
 +19   m4 @0x00007f7e95cafd38 opcode=16 ff 01                inc    (%rcx)[4byte] -> (%rcx)[4byte]
 +21   m4 @0x00007f7e994ef0c0 opcode=14 83 39 14             cmp    (%rcx)[4byte] $0x00000014
 +24   m4 @0x00007f7e95cae2a8 opcode=38 7c fe                jl     @0x00007f7e95cad870[8byte]
 +26   m4 @0x00007f7e95cad708 opcode=55 65 48 8b 0c 25 10 00 mov    %gs:0x10[8byte] -> %rcx
                               00 00
 +35   L4 @0x00007f7e95cabff0 opcode=46 e9 33 ea b2 f6       jmp    $0x00007f7e8c7f1938
 +40   m4 @0x00007f7e95cad870 opcode=3                      <label>
 +40   m4 @0x00007f7e95cabdc0 opcode=56 65 48 89 34 25 08 00 mov    %rsi -> %gs:0x08[8byte]
                               00 00
 +49   m4 @0x00007f7e95cac1c0 opcode=56 65 48 89 3c 25 18 00 mov    %rdi -> %gs:0x18[8byte]
                               00 00
 +58   m4 @0x00007f7e95cb0dd0 opcode=57 48 be 38 19 7f 8c 7e mov    $0x00007f7e8c7f1938 -> %rsi
                               7f 00 00
 +68   m4 @0x00007f7e95cac0d8 opcode=57 48 bf 38 19 7f 8c 7e mov    $0x00007f7e8c7f1938 -> %rdi
                               7f 00 00
 +78   m4 @0x00007f7e95cac7c0 opcode=393 a6                   cmps   %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi -> %rsi %rdi
 +79   m4 @0x00007f7e95cad2e8 opcode=157 0f 85 fa ff ff ff    jnz    @0x00007f7e95cabe28[8byte]
 +85   m4 @0x00007f7e95cae1d8 opcode=57 48 b9 38 19 7f 8c 7e mov    $0x00007f7e8c7f1938 -> %rcx
                               7f 00 00
 +95   m4 @0x00007f7e95cacaa8 opcode=14 48 3b f1             cmp    %rsi %rcx
 +98   m4 @0x00007f7e95cafb38 opcode=57 48 b9 12 00 00 00 00 mov    $0x0000000000000012 -> %rcx
                               00 00 00
 +108  m4 @0x00007f7e95cabc28 opcode=165 0f 8d fa ff ff ff    jnl    @0x00007f7e95cad8f0[8byte]
 +114  m4 @0x00007f7e95cadf28 opcode=57 48 bf 4a 19 7f 8c 7e mov    $0x00007f7e8c7f194a -> %rdi
                               7f 00 00
 +124  m4 @0x00007f7e95cad808 opcode=57 48 be 4a 19 7f 8c 7e mov    $0x00007f7e8c7f194a -> %rsi
                               7f 00 00
 +134  m4 @0x00007f7e95cad8f0 opcode=3                      <label>
 +134  m4 @0x00007f7e95cae240 opcode=394 f3 a6                rep cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi %rcx -> %rsi %rdi %rcx
 +136  m4 @0x00007f7e95cabe28 opcode=3                      <label>
 +136  m4 @0x00007f7e95cad608 opcode=55 65 48 8b 0c 25 10 00 mov    %gs:0x10[8byte] -> %rcx
                               00 00
 +145  m4 @0x00007f7e95cabb58 opcode=55 65 48 8b 34 25 08 00 mov    %gs:0x08[8byte] -> %rsi
                               00 00
 +154  m4 @0x00007f7e95cad588 opcode=55 65 48 8b 3c 25 18 00 mov    %gs:0x18[8byte] -> %rdi
                               00 00
 +163  L4 @0x00007f7e95cac4f0 opcode=157 0f 85 32 ea b2 f6    jnz    $0x00007f7e8c7f1938
 +169  L3 @0x00007f7e95cade10 opcode=4 48 83 c4 30          add    $0x0000000000000030 %rsp -> %rsp
 +173  L3 @0x00007f7e95cb0968 opcode=20 5d                   pop    %rsp (%rsp)[8byte] -> %rbp %rsp
 +174  L3 @0x00007f7e95cacd90 opcode=57 49 ba 00 e0 7f d3 80 mov    $0x00007f80d37fe000 -> %r10
                               7f 00 00
 +184  L3 @0x00007f7e95cac488 opcode=60 41 85 02             test   (%r10)[4byte] %eax
 +187  m4 @0x00007f7e95cabbc0 opcode=56 65 48 89 0c 25 10 00 mov    %rcx -> %gs:0x10[8byte]
                               00 00
 +196  m4 @0x00007f7e994ef590 opcode=20 59                   pop    %rsp (%rsp)[8byte] -> %rcx %rsp
 +197  L4 @0x00007f7e95cabd58 opcode=46 e9 fb e8 97 f9       jmp    $0x00007f808f641800 <shared_bb_ibl_ret>
END 0x00007f7e8c7f1938
kuhanov commented 3 years ago

What are the page protections on this app code?

tag 0x00007f7e8c7f1938

cat /proc/401468/maps
...
7f7e8c62e000-7f7e8c89e000 rwxp 00000000 00:00 0.
...
kuhanov commented 3 years ago

What happens w/ -sandbox2ro_threshold 0 -ro2sandbox_threshold 0?

No crashes with these options. Kirill

derekbruening commented 3 years ago

We need to figure out the timing here: was this fragment flushed (for an ro2sandbox transition) but not yet fully deleted and the translation request came in for the half-deleted fragment after the page was made writable?

Xref -safe_translate_flushed which IIRC was supposed to solve such issues but never enabled by default b/c of performance problems. Or is the sandboxing decision supposed to come from the fragment flags and not the vmareas? Not remembering the details.

kuhanov commented 3 years ago

Xref -safe_translate_flushed

In my case, this option doesn't work at all. I have hang at the beginning of benchmark run with 2 java threads. (gdb) info threads Id Target Id Frame

kuhanov commented 3 years ago

Xref -safe_translate_flushed

In my case, this option doesn't work at all. I have hang at the beginning of benchmark run with 2 java threads. (gdb) info threads Id Target Id Frame

  • 1 LWP 466688 "java" 0x00007f67f4225b82 in ?? () 2 LWP 466689 "java" 0x00007f683854a7ea in ?? () Kirill

Hi, @derekbruening Show we investigate anything else? Or could use -sandbox2ro_threshold 0 -ro2sandbox_threshold 0 options? Is it ok? Thanks, Kirill

derekbruening commented 3 years ago

Show we investigate anything else? Or could use -sandbox2ro_threshold 0 -ro2sandbox_threshold 0 options? Is it ok?

Disabling those parameters (by setting to 0) should work correctly but may have extra overhead.

It would be good to confirm that the problem with those parameters being enabled is indeed a half-deleted fragment: if logs are available, look for an entry for the fragment with the translation problem being unlinked or other steps toward deletion prior to the translation issue.

kuhanov commented 3 years ago

Disabling those parameters (by setting to 0) should work correctly but may have extra overhead.

It would be good to confirm that the problem with those parameters being enabled is indeed a half-deleted fragment: if logs are available, look for an entry for the fragment with the translation problem being unlinked or other steps toward deletion prior to the translation issue.

Could not enable full logging because huge time for reproducing in debug mode. Try to add logs in fragment_prepare_for_removal_from_table

        dr_fprintf(STDERR,
            "fragment_prepare_for_removal_from_table: remove frag @@" PFX " (tag " PFX ")\n",
            f->start_pc, f->tag);

bad fragment recreate_app : looking for 0x00007ff8245da446 in frag @ 0x00007ff8245da41d (tag 0x00007ff6202cf0e2) there are no fragment_prepare_for_removal_from_table: remove frag @@ logs with 0x00007ff8245da41d

Or am I wrong? Need to add anything else to catch removing? Kirill

derekbruening commented 3 years ago

I think adding logging in fragment_unlink_for_deletion() would better since these may not be indirect branch targets.

kuhanov commented 3 years ago

I think adding logging in fragment_unlink_for_deletion() would better since these may not be indirect branch targets.

Still no removing. For example, recreate_app : looking for 0x00007fc205ed319b in frag @ 0x00007fc205ed3191 (tag 0x00007fbffd1cdfbc)

there are a few fragment unlinking for this tag for another fragments but not for 0x00007fc205ed3191

fragment_unlink_for_deletion: remove frag @ 0x00007fc205e9ff48 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205f1bcc8 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ec2d14 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ebca1d (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205f474e4 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ee57b4 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ee1da8 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ea3314 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ee8bcc (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ecfbc8 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ed3180 (tag 0x00007fbffd1cdfbc)

Kirill

kuhanov commented 3 years ago

just observation

This is 1st fragment (for fragments when we handle signals) where bb->flags include FRAG_SELFMOD_SANDBOXED

master_signal_handler_C()
record_pending_signal()
translate_sigcontext()
translate_mcontext()
recreate_app_state()
recreate_app_state_internal()
recreate_fragment_ilist
recreate_bb_ilist
build_bb_ilist
check_new_page_start
check_thread_vm_area  -> set flags to FRAG_SELFMOD_SANDBOXED

Kirill

derekbruening commented 3 years ago

This is 1st fragment (for fragments when we handle signals) where bb->flags include FRAG_SELFMOD_SANDBOXED

But that flag was not set when the fragment was first created -- we need to figure out how the vmarea had its flags changed without flushing all the fragments inside (since we already looked for this fragment being partially deleted from being flushed). If -logmask LOG_VMAREAS is too much output I guess targeted logs on vmareas being marked as sandboxed would be needed to try and figure out the timing.

kuhanov commented 3 years ago

This is 1st fragment (for fragments when we handle signals) where bb->flags include FRAG_SELFMOD_SANDBOXED

But that flag was not set when the fragment was first created -- we need to figure out how the vmarea had its flags changed without flushing all the fragments inside (since we already looked for this fragment being partially deleted from being flushed). If -logmask LOG_VMAREAS is too much output I guess targeted logs on vmareas being marked as sandboxed would be needed to try and figure out the timing.

DRIO doesn't mark vmarea as sandboxed, DRIO sets this flag for fragment in check_thread_vm_area() directly

if (ok && ro2s->written_count >= DYNAMO_OPTION(ro2sandbox_threshold)) {
    ...
    frag_flags |= SANDBOX_FLAG();
    ...

Kirill

derekbruening commented 3 years ago

DRIO doesn't mark vmarea as sandboxed, DRIO sets this flag for fragment in check_thread_vm_area() directly

You mean, when the fragment is created that threshold has not been crossed, but when it recreates the fragment the threshold has been crossed (due to some concurrent execution in another thread or something)? But this code you've quoted is only entered when an area is not on the executable list: which means it was removed on a flush (or it's the very first execution for non-ELF-image regions). But you saw no flush? Maybe re-search for a flush: look for flush_fragments_in_region_start maybe.

kuhanov commented 3 years ago

DRIO doesn't mark vmarea as sandboxed, DRIO sets this flag for fragment in check_thread_vm_area() directly

You mean, when the fragment is created that threshold has not been crossed, but when it recreates the fragment the threshold has been crossed (due to some concurrent execution in another thread or something)? But this code you've quoted is only entered when an area is not on the executable list: which means it was removed on a flush (or it's the very first execution for non-ELF-image regions). But you saw no flush? Maybe re-search for a flush: look for flush_fragments_in_region_start maybe.

Added log at the top of flush_fragments_in_region_start. on my last run I had issue with recreate_app : looking for 0x00007f5beb76829d in frag @ 0x00007f5beb768299 (tag 0x00007f59e1093e1b) tag is included to the region that was flushed before that point

FLUSH flush_fragments_in_region_start (thread 1468315 flushtime 3972): 0x00007f59e1000000-
0x00007f59e1270000
new executable area 0x00007f59e1000000-0x00007f59e1270000 written >= 10X => switch to sandboxing

I've tried to search the same tag before we had the similar fragment

FLUSH flush_fragments_in_region_start (thread 1468315 flushtime 3961): 0x00007f59e1000000-0x00007f59e1270000
FLUSH flush_fragments_in_region_start (thread 1468339 flushtime 3962): 0x00007f59e1012000-0x00007f59e1013000
FLUSH flush_fragments_in_region_start (thread 1468347 flushtime 3962): 0x00007f59e1052000-0x00007f59e1053000
FLUSH flush_fragments_in_region_start (thread 1468342 flushtime 3962): 0x00007f59e109e000-0x00007f59e109f000
FLUSH flush_fragments_in_region_start (thread 1468346 flushtime 3962): 0x00007f59e1097000-0x00007f59e1098000
FLUSH flush_fragments_in_region_start (thread 1468344 flushtime 3962): 0x00007f59e1052000-0x00007f59e1053000

recreate_app : looking for 0x00007f5beb49e87a in frag @ 0x00007f5beb49e7cd (tag 0x00007f59e1093e1b)

but it was ok and address was matched

TAG  0x00007f59e1093e1b
 +0    m4 @0x00007f59f0698978 opcode=56 65 48 89 0c 25 10 00 mov    %rcx -> %gs:0x10[8byte]
                               00 00
 +9    m4 @0x00007f59f069a660 opcode=57 48 b9 00 00 00 00 00 mov    $0x0000000000000000 -> %rcx
                               00 00 00
 +19   m4 @0x00007f59f069a290 opcode=16 ff 01                inc    (%rcx)[4byte] -> (%rcx)[4byte]
 +21   m4 @0x00007f59f0697498 opcode=14 83 39 14             cmp    (%rcx)[4byte] $0x00000014
 +24   m4 @0x00007f59f069a9c8 opcode=38 7c fe                jl     @0x00007f59f0695f20[8byte]
 +26   m4 @0x00007f59f0698020 opcode=55 65 48 8b 0c 25 10 00 mov    %gs:0x10[8byte] -> %rcx
                               00 00
 +35   L4 @0x00007f59f16eb058 opcode=46 e9 16 6f 9e f0       jmp    $0x00007f59e1093e1b
 +40   m4 @0x00007f59f0695f20 opcode=3                      <label>
 +40   m4 @0x00007f59f0698c50 opcode=56 65 48 89 34 25 08 00 mov    %rsi -> %gs:0x08[8byte]
                               00 00
 +49   m4 @0x00007f59f0696fe0 opcode=56 65 48 89 3c 25 18 00 mov    %rdi -> %gs:0x18[8byte]
                               00 00
 +58   m4 @0x00007f59f0697430 opcode=57 48 be 1b 3e 09 e1 59 mov    $0x00007f59e1093e1b -> %rsi
                               7f 00 00
 +68   m4 @0x00007f59f069af30 opcode=57 48 bf 1b 3e 09 e1 59 mov    $0x00007f59e1093e1b -> %rdi
                               7f 00 00
 +78   m4 @0x00007f59f0696718 opcode=393 a6                   cmps   %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi -> %rsi %rdi
 +79   m4 @0x00007f59f0696c90 opcode=157 0f 85 fa ff ff ff    jnz    @0x00007f59f069a6e0[8byte]
 +85   m4 @0x00007f59f0695f88 opcode=57 48 b9 1b 3e 09 e1 59 mov    $0x00007f59e1093e1b -> %rcx
                               7f 00 00
 +95   m4 @0x00007f59f06964c8 opcode=14 48 3b f1             cmp    %rsi %rcx
 +98   m4 @0x00007f59f06993e0 opcode=57 48 b9 0a 00 00 00 00 mov    $0x000000000000000a -> %rcx
                               00 00 00
 +108  m4 @0x00007f59f0695c28 opcode=165 0f 8d fa ff ff ff    jnl    @0x00007f59f0699178[8byte]
 +114  m4 @0x00007f59f06962e0 opcode=57 48 bf 25 3e 09 e1 59 mov    $0x00007f59e1093e25 -> %rdi
                               7f 00 00
 +124  m4 @0x00007f59f06983f0 opcode=57 48 be 25 3e 09 e1 59 mov    $0x00007f59e1093e25 -> %rsi
                               7f 00 00
 +134  m4 @0x00007f59f0699178 opcode=3                      <label>
 +134  m4 @0x00007f59f06963c8 opcode=394 f3 a6                rep cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi %rcx -> %rsi %rdi %rcx
 +136  m4 @0x00007f59f069a6e0 opcode=3                      <label>
 +136  m4 @0x00007f59f0697780 opcode=55 65 48 8b 0c 25 10 00 mov    %gs:0x10[8byte] -> %rcx
                               00 00
 +145  m4 @0x00007f59f069a190 opcode=55 65 48 8b 34 25 08 00 mov    %gs:0x08[8byte] -> %rsi
                               00 00
 +154  m4 @0x00007f59f0699db0 opcode=55 65 48 8b 3c 25 18 00 mov    %gs:0x18[8byte] -> %rdi
                               00 00
 +163  L4 @0x00007f59f0699f28 opcode=157 0f 85 15 6f 9e f0    jnz    $0x00007f59e1093e1b
 +169  L3 @0x00007f59f0698208 opcode=55 8b 44 85 18          mov    0x18(%rbp,%rax,4)[4byte] -> %eax
 +173  L3 @0x00007f59f0697eb8 opcode=60 41 85 03             test   (%r11)[4byte] %eax
 +176  L3 @0x00007f59f0697b50 opcode=60 85 c0                test   %eax %eax
 +178  L4 @0x00007f59f0698df0 opcode=165 0f 8d b0 19 df fa    jnl    $0x00007f5beb49e8b6
 +184  L4 @0x00007f59f0698108 opcode=46 e9 d5 19 df fa       jmp    $0x00007f5beb49e8da
END 0x00007f59e1093e1b

cache pc 0x00007f5beb49e7cd vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e7d6 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e7e0 vs 0x00007f5beb49e87a 2 0x0000000000000000
cache pc 0x00007f5beb49e7e2 vs 0x00007f5beb49e87a 3 0x0000000000000000
cache pc 0x00007f5beb49e7e5 vs 0x00007f5beb49e87a 2 0x0000000000000000
cache pc 0x00007f5beb49e7e7 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e7f0 vs 0x00007f5beb49e87a 5 0x0000000000000000
cache pc 0x00007f5beb49e7f5 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e7fe vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e807 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e811 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e81b vs 0x00007f5beb49e87a 1 0x0000000000000000
cache pc 0x00007f5beb49e81c vs 0x00007f5beb49e87a 6 0x0000000000000000
cache pc 0x00007f5beb49e822 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e82c vs 0x00007f5beb49e87a 3 0x0000000000000000
cache pc 0x00007f5beb49e82f vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e839 vs 0x00007f5beb49e87a 6 0x0000000000000000
cache pc 0x00007f5beb49e83f vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e849 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e853 vs 0x00007f5beb49e87a 2 0x0000000000000000
cache pc 0x00007f5beb49e855 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e85e vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e867 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e870 vs 0x00007f5beb49e87a 6 0x0000000000000000
cache pc 0x00007f5beb49e876 vs 0x00007f5beb49e87a 4 0x00007f5beb49e8cf
cache pc 0x00007f5beb49e87a vs 0x00007f5beb49e87a 3 0x00007f5beb49e8d3
2 recreate_app -- found valid state pc 0x00007f59e1093e1f
1 recreate_app -- found ok pc 0x00007f59e1093e1f

Kirill

kuhanov commented 3 years ago

Hi, @derekbruening. Currently we try to use tools. First of all we added default DynamoRIO tools but most of them had crashes. Let's look at instrace_simple (disable all fprintf at tool because they produce SIGBUS) with just clean java call without any workload.

.bin64/drrun -disable_traces -c ./api/bin/libinstrace_simple.so -- java -XX:+ShowMessageBoxOnError

crash

Unexpected Error
------------------------------------------------------------------------------
SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700

Do you want to debug the problem?

To debug, run 'gdb /proc/2092212/exe 2092212'; then switch to thread 2092231 (0x00007f5acd0ae700)
Enter 'yes' to launch gdb automatically (PATH must include gdb)
Otherwise, press RETURN to abort...
==============================================================================
gdb /proc/2092212/exe 20922122092231
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700
#
# JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0-internal-debug-root_2021_07_19_10_14-b00)
# Java VM: OpenJDK 64-Bit Server VM (25.71-b00-debug mixed mode linux-amd64 compressed oops)
# Problematic frame:
# 0x00007f5d3355dadc V  [libjvm.so+0x7f2adc]  CodeHeap::add_to_freelist(HeapBlock*)+0x1c
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/huawei/builds/cronbuild-8.0.18887/projects/hs_err_pid2092212.log

stack

(gdb) bt
#0  0x00007f5cf33cef68 in ?? ()
#1  0x00007f5acd0ac910 in ?? ()
#2  0x00007f5d00000000 in ?? ()
#3  0x00007f5acd0ac980 in ?? ()
#4  0x0000000000000010 in ?? ()
#5  0x00007f5acd0ac9b0 in ?? ()
#6  0x00007f5d33916b76 in os::message_box (title=0x7f5d33ea790b "Unexpected Error",
    message=0x7f5d344be420 "SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700\n\nDo you want to debug the problem?\n\nTo debug, run 'gdb /proc/2092212/exe 2092212'; then switch to thread 2092231 (0x00007f5a"...) at /root/builds/kuhanov/openjdk8u/hotspot/src/os/linux/vm/os_linux.cpp:5516
#7  0x00007f5d33af1f65 in VMError::show_message_box (this=0x7f5acd0acbc0,
    buf=0x7f5d344be420 "SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700\n\nDo you want to debug the problem?\n\nTo debug, run 'gdb /proc/2092212/exe 2092212'; then switch to thread 2092231 (0x00007f5a"..., buflen=2000) at /root/builds/kuhanov/openjdk8u/hotspot/src/os/linux/vm/vmError_linux.cpp:53
#8  0x00007f5d33af104c in VMError::report_and_die (this=0x7f5acd0acbc0) at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/utilities/vmError.cpp:955
#9  0x00007f5d3391bd59 in JVM_handle_linux_signal (sig=11, info=0x7f5acd0ace90, ucVoid=0x7f5acd0acd60, abort_if_unrecognized=1)
    at /root/builds/kuhanov/openjdk8u/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:558
#10 0x00007f5d339146d9 in signalHandler (sig=11, info=0x7f5acd0ace90, uc=0x7f5acd0acd60) at /root/builds/kuhanov/openjdk8u/hotspot/src/os/linux/vm/os_linux.cpp:4588
#11 <signal handler called>
#12 0x00007f5d3355dadc in CodeHeap::add_to_freelist (this=0x71c5c56479c5fc64, a=0xc5d1df2941c4c7df)
    at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp:363
#13 0x00007f5d3355d4eb in CodeHeap::deallocate (this=0x71c5c56479c5fc64, p=0xf9c5160c6ff9c506) at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp:240

registers context for frame #12

(gdb) f 12
#12 0x00007f5d3355dadc in CodeHeap::add_to_freelist (this=0x71c5c56479c5fc64, a=0xc5d1df2941c4c7df)
    at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp:363
363     /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp: No such file or directory.
(gdb) i r
rax            0x71c5c56479c5fc64  8198175732028275812
rbx            0x7f5accf8e040      140027962646592
rcx            0x0                 0
rdx            0xc5d1df2941c4c7df  -4192324409815152673
rsi            0xc5d1df2941c4c7df  -4192324409815152673
rdi            0x71c5c56479c5fc64  8198175732028275812
rbp            0x7f5acd0ad450      0x7f5acd0ad450
rsp            0x7f5acd0ad420      0x7f5acd0ad420
r8             0x7f5aec187000      140028484808704
r9             0x4                 4
r10            0x0                 0
r11            0x286               646
r12            0x1fecc7            2092231
r13            0x7f5d32d697cf      140038261610447
r14            0x7f5d32d698b0      140038261610672
r15            0x7f5acd0adfc0      140027963826112
rip            0x7f5d3355dadc      0x7f5d3355dadc <CodeHeap::add_to_freelist(HeapBlock*)+28>
eflags         0x10202             [ IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

Looks like DRIO restores registers incorrectly when it added instrumentation instructions to bb. CodeHeap::add_to_freelist (this=0x71c5c56479c5fc64, a=0xc5d1df2941c4c7df) rax 0x71c5c56479c5fc64

As we understand there is some lazy algorithm for saving and restoring registers that are used in instrumentaion instructions. What we do to check that - just save and restore registers always

diff --git a/ext/drreg/drreg.c b/ext/drreg/drreg.c
index a711cbea..ff12a95f 100644
--- a/ext/drreg/drreg.c
+++ b/ext/drreg/drreg.c
@@ -951,7 +951,7 @@ drreg_reserve_reg_internal(void *drcontext, instrlist_t *ilist, instr_t *where,
     pt->reg[GPR_IDX(reg)].in_use = true;
     if (!already_spilled) {
         /* Even if dead now, we need to own a slot in case reserved past dead point */
-        if (ops.conservative ||
+        if (true || ops.conservative ||
             drvector_get_entry(&pt->reg[GPR_IDX(reg)].live, pt->live_idx) == REG_LIVE) {
             LOG(drcontext, DR_LOG_ALL, 3, "%s @%d." PFX ": spilling %s to slot %d\n",
                 __FUNCTION__, pt->live_idx, get_where_app_pc(where),
@@ -1236,7 +1236,7 @@ drreg_unreserve_register(void *drcontext, instrlist_t *ilist, instr_t *where,
         return DRREG_ERROR_INVALID_PARAMETER;
     LOG(drcontext, DR_LOG_ALL, 3, "%s @%d." PFX " %s\n", __FUNCTION__, pt->live_idx,
         get_where_app_pc(where), get_register_name(reg));
-    if (drmgr_current_bb_phase(drcontext) != DRMGR_PHASE_INSERTION) {
+    if (true || drmgr_current_bb_phase(drcontext) != DRMGR_PHASE_INSERTION) {
         /* We have no way to lazily restore.  We do not bother at this point
          * to try and eliminate back-to-back spill/restore pairs.
          */

So, the crashes were dissapeared and we could collect tool statistics. Could you look at this issue on your side (reproducer is very simple)? What could be missed and store/restore registers algorithm here? Thanks, Kirill

derekbruening commented 3 years ago

@sapostolakis has a tool that tries to systematically find register state errors such as from drreg that might be able to help here. Tracking these things down can be difficult. Here, one approach would be a binary search over blocks, turning the lazy restores on at N blocks and locating the problematic block that way, if the block sequence is deterministic.

I would first suspect a bad interaction with something unique to Java vs normal apps (since we're using drreg on very large x86 apps and this code is fairly well exercised on regular apps): selfmod sandboxing. I wonder if there's some register usage by the sandboxing mangling that breaks drreg. Does disabling that mangling also solve the problem?

kuhanov commented 2 years ago

Does disabling that mangling also solve the problem?

no, disabling sandboxing didn't help here - the same crash.

.bin64/drrun -disable_traces -sandbox2ro_threshold 0 -ro2sandbox_threshold 0 -c ./api/bin/libinstrace_simple.so -- java -XX:+ShowMessageBoxOnError

Kirill

derekbruening commented 2 years ago

-sandbox2ro_threshold 0 -ro2sandbox_threshold 0 doesn't disable all sandboxing. -no_sandbox_writes partially does it; I think -no_hw_cache_consistency might be the only way to completely disable -- at the risk of incorrect execution if there is truly modified code.

kuhanov commented 2 years ago

-sandbox2ro_threshold 0 -ro2sandbox_threshold 0 doesn't disable all sandboxing. -no_sandbox_writes partially does it; I think -no_hw_cache_consistency might be the only way to completely disable -- at the risk of incorrect execution if there is truly modified code.

-no_sandbox_writes has the same issue -no_hw_cache_consistency without crash Kirill

kuhanov commented 2 years ago

Hi, @derekbruening, @AssadHashmi, @fhahn. Currently we tried to run java workloads on Aarch64 like we have on x86 now. But we got hangs on heavy runs (HelloWorld is ok). What we could see in the debugger that all threads waits on futex and one thread with wake futex (count of wake threads is 0 and futex address doesn't present in wait threads. this is strange)

dump for threads: SYS_futex(0x62), uint32_t uaddr, int futex_op, uint32_t val,const struct timespec timeout..

              pc              x8              x0              x1              x2              x3
  0xffff6da3d3b0            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6c70f948            0x62  0xfffd682abd8c            0x80             0x0             0x0
  0xffff6c70f948            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6d8b53b0            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6d84d3b0            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6d8253b0            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6c70f948            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6c70f948            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6d7dd330            0x62  0xfffd6827ee88            0x80             0x0  0xfffc08e91088
  0xffff6d785330            0x62  0xfffd6827c288            0x80             0x0  0xfffc09092108
  0xffff6d68d3b0            0x62  0xfffd68279788            0x80             0x0  0xfffc09293188
  0xffff6d675330            0x62  0xfffd68276c88            0x80             0x0  0xfffc09494208
  0xffff6d65d328            0x62  0xffffb0476700            0x80             0x2             0x0
  0xffff6d645330            0x62  0xfffd68271588            0x80             0x0  0xfffc09896308
  0xffff6d62d330            0x62  0xfffd6826ea88            0x80             0x0  0xfffc09a97388
  0xffff6d6153b0            0x62  0xfffd6826bf88            0x80             0x0  0xfffc09c98008
  0xffff6d5fd330            0x62  0xfffd68261388            0x80             0x0  0xfffc09e99088
  0xffff6d5e5330            0x62  0xfffd6825e888            0x80             0x0  0xfffc0a09a108
  0xffff6d5cd330            0x62  0xfffd6825bd88            0x80             0x0  0xfffc0a29b188
  0xffff6d585330            0x62  0xfffd68258b88            0x80             0x0  0xfffc0a49c208
  0xffff6c70f948            0x62  0xffffb0452f80           0x189             0x0             0x0
  0xffff6c70f948            0x62  0xfffd6820298c            0x80             0x0             0x0
  0xffff6c70f948            0x62  0xfffd681fd18c            0x80             0x0             0x0
  0xffff6d385330            0x62  0xfffd681eee88            0x80             0x0  0xfffc0ac9cf78
  0xffff6c70f948            0x62  0xfffd680a6988            0x80             0x0             0x0
  0xffff6c70f948            0x62  0xfffd680a4988            0x80             0x0             0x0

Do you have some ideas what could be wrong here? where could we investigate that in code? are this DR internals?

SIGUSR2 signal to the process resumes threads from futex. Thanks, Kirill