Open rkgithubs opened 5 years ago
translate_mcontext is fail and we try to recreate_app_pc and it returns 0x0 here Looks like the issue is similar with #307
Looks like #307 is specific to trace building: but I thought you were running with -disable_traces
? #307 should be impossible with -disable_traces
.
-disable_traces
yes, we use '-disable_traces' in all our runs under DynamoRIO.
To summarize. Is it possible to skip FAKE_TAG setting or not? Is workflow correct after that? do we just spend more time on searching?
I believe removing that FAKE_TAG needs a corresponding change to add tag clearing once all threads have exited the cache in the later stages of shared fragment deletion. Without that it will keep missing in the table, causing potentially severe performance problems.
Ok. So we disable it for our experiments to exclude crashes now. It's sad that the performance suffered. Kirill
-disable_traces
yes, we use '-disable_traces' in all our runs under DynamoRIO.
OK so the translation failure would be different from #307. Would it be possible to get further information on the failure? Is it inside selfmod mangling?
I wonder if you could disable this line in fragment_prepare_for_removal_from_table()
:
ftable->table[hindex].start_pc_fragment = pending_delete_pc;
And keep the line that disables the lookup (ftable->table[hindex].tag_fragment = FAKE_TAG;
).
Then lookups are disabled and the only issue is a thread that just did the lookup and is about to jump to the target fragment in the cache. But that thread could already be inside the target anyway. I don't recall a scheme where a synch is done and a thread left alone at that point in the IBL while the fragment is truly deleted from the cache: the thread would be translated somewhere fresh.
Does that solve the crashes without affecting performance?
I filed the target_delete bug separately as #5061 to better track its fix
I wonder if you could disable this line in
fragment_prepare_for_removal_from_table()
:ftable->table[hindex].start_pc_fragment = pending_delete_pc;
And keep the line that disables the lookup (
ftable->table[hindex].tag_fragment = FAKE_TAG;
). Then lookups are disabled and the only issue is a thread that just did the lookup and is about to jump to the target fragment in the cache. But that thread could already be inside the target anyway. I don't recall a scheme where a synch is done and a thread left alone at that point in the IBL while the fragment is truly deleted from the cache: the thread would be translated somewhere fresh.Does that solve the crashes without affecting performance?
No crashes, no performance degradations in case of
// ftable->table[hindex].start_pc_fragment = pending_delete_pc;
ftable->table[hindex].tag_fragment = FAKE_TAG;
Thanks a lot
further
master_signal_handler_C()->record_pending_signal()-> translate_sigcontext()->translate_mcontext()->recreate_app_state()->recreate_app_state_internal()->recreate_app_state_from_ilist
In bad situation we go thriough instruction list in fragment and got miss
recreate_app : looking for 0x00007f6f9d6fc327 in frag @ 0x00007f6f9d6fc31d (tag 0x00007f6d951c757c)
cache pc 0x00007f6f9d6fc31d vs 0x00007f6f9d6fc327
cache pc 0x00007f6f9d6fc326 vs 0x00007f6f9d6fc327
cache pc 0x00007f6f9d6fc330 vs 0x00007f6f9d6fc327
recreate_app -- WARNING: cache pc 0x00007f6f9d6fc330 != 0x00007f6f9d6fc327, probably prefix instruction
recreate_app -- found valid state pc 0x00007f6d951c757c
recreate_app -- found ok pc 0x00007f6d951c757c
BUT if print instructions from gdb for both frag @ 0x00007f6f9d6fc31d and tag 0x00007f6d951c757c we could see that the step between instructions are not the same like we have inside DynamoRIO ### (probably incorrect instruction length calculation. cache pc 0x00007f6f9d6fc31d is 'movabs $0x7f6fe08a1000,%r10' and DRIO calculate size of 9 bytes BUT in real it takes 10 bytes). And we catch 0x00007f6f9d6fc327 in debugger
(gdb) x /5i 0x00007f6f9d6fc31d
0x7f6f9d6fc31d: movabs $0x7f6fe08a1000,%r10
0x7f6f9d6fc327: test %eax,(%r10)
0x7f6f9d6fc32a: cmp %ebx,%r8d
0x7f6f9d6fc32d: jge 0x7f6f9d9a04cc
0x7f6f9d6fc333: jmpq 0x7f6f9d9a04cc
(gdb) x /5i 0x00007f6d951c757c
0x7f6d951c757c: movabs $0x7f6fe08a1000,%r10
0x7f6d951c7586: test %eax,(%r10)
0x7f6d951c7589: cmp %ebx,%r8d
0x7f6d951c758c: jge 0x7f6d951c75a9
0x7f6d951c758e: mov 0x10(%rdi),%r9
Kirill
If you could print the raw bytes for both of the movabs
instructions (sthg I wish gdb would do by default...disas/r
does it for a function) -- is there a non-standard prefix on the app version?
And if possible print the instr_t
for the movabs
instr in the recreated instrlist. Need to find what is causing the different encodings.
Is this fragment using stored translations, or did it recreate the list? (More log output should show this.) If it's stored translations this may be a bug in those and have nothing to do with the decoder/encoder.
is there a non-standard prefix on the app versio
(gdb) x /10i 0x00007fbf3678fbf1
0x7fbf3678fbf1: mov %rbp,0x10(%rbx)
0x7fbf3678fbf5: mov %rbx,%r10
0x7fbf3678fbf8: shr $0x9,%r10
0x7fbf3678fbfc: movabs $0x7f7f985af000,%r11
0x7fbf3678fc06: movb $0x0,(%r11,%r10,1)
0x7fbf3678fc0b: mov %rbx,%rax
0x7fbf3678fc0e: add $0x40,%rsp
0x7fbf3678fc12: pop %rbp
0x7fbf3678fc13: movabs $0x7fbf794af000,%r10
0x7fbf3678fc1d: test %eax,(%r10)
(gdb) x /44b 0x00007fbf3678fbf1
0x7fbf3678fbf1: 0x48 0x89 0x6b 0x10 0x4c 0x8b 0xd3 0x49
0x7fbf3678fbf9: 0xc1 0xea 0x09 0x49 0xbb 0x00 0xf0 0x5a
0x7fbf3678fc01: 0x98 0x7f 0x7f 0x00 0x00 0x43 0xc6 0x04
0x7fbf3678fc09: 0x13 0x00 0x48 0x8b 0xc3 0x48 0x83 0xc4
0x7fbf3678fc11: 0x40 0x5d 0x49 0xba 0x00 0xf0 0x4a 0x79
0x7fbf3678fc19: 0xbf 0x7f 0x00 0x00
(gdb) x /10i 0x00007fbd2d1c8e04
0x7fbd2d1c8e04: mov %rbp,0x10(%rbx)
0x7fbd2d1c8e08: mov %rbx,%r10
0x7fbd2d1c8e0b: shr $0x9,%r10
0x7fbd2d1c8e0f: movabs $0x7f7f985af000,%r11
0x7fbd2d1c8e19: movb $0x0,(%r11,%r10,1)
0x7fbd2d1c8e1e: mov %rbx,%rax
0x7fbd2d1c8e21: add $0x40,%rsp
0x7fbd2d1c8e25: pop %rbp
0x7fbd2d1c8e26: movabs $0x7fbf794af000,%r10
0x7fbd2d1c8e30: test %eax,(%r10)
(gdb) x /44b 0x7fbd2d1c8e04
0x7fbd2d1c8e04: 0x48 0x89 0x6b 0x10 0x4c 0x8b 0xd3 0x49
0x7fbd2d1c8e0c: 0xc1 0xea 0x09 0x49 0xbb 0x00 0xf0 0x5a
0x7fbd2d1c8e14: 0x98 0x7f 0x7f 0x00 0x00 0x43 0xc6 0x04
0x7fbd2d1c8e1c: 0x13 0x00 0x48 0x8b 0xc3 0x48 0x83 0xc4
0x7fbd2d1c8e24: 0x40 0x5d 0x49 0xba 0x00 0xf0 0x4a 0x79
0x7fbd2d1c8e2c: 0xbf 0x7f 0x00 0x00
instr in the recreated instrlist.
inst->bytes us NULL pointer in bad case - no bytes
there are cases when we calculate size of movabs correctly (inst->bytes is NULL too)
cache pc 0x00007fbb58cf79dd vs 0x00007fbb58cf79fd 2 0x00007fbb9b2788e7
cache pc 0x00007fbb58cf79df vs 0x00007fbb58cf79fd 3 0x00007fbb9b2788e9
cache pc 0x00007fbb58cf79e2 vs 0x00007fbb58cf79fd 4 0x00007fbb9b2788ec
**_cache pc 0x00007fbb58cf79e6 vs 0x00007fbb58cf79fd size 10 inst->bytes 0x0000000000000000_**
cache pc 0x00007fbb58cf79f0 vs 0x00007fbb58cf79fd 3 0x00007fbb9b2788f7
cache pc 0x00007fbb58cf79f3 vs 0x00007fbb58cf79fd 3 0x00007fbb9b2788fa
cache pc 0x00007fbb58cf79f6 vs 0x00007fbb58cf79fd 4 0x00007fbb9b2788fd
cache pc 0x00007fbb58cf79fa vs 0x00007fbb58cf79fd 3 0x00007fbb9b278901
cache pc 0x00007fbb58cf79fd vs 0x00007fbb58cf79fd 6 0x00007fbb9b278904
2 recreate_app -- found valid state pc 0x00007fbb9b278904
1 recreate_app -- found ok pc 0x00007fbb9b278904
(gdb) x /10i 0x00007fbb58cf79dd
0x7fbb58cf79dd: mov %eax,%eax
0x7fbb58cf79df: and %rbx,%rax
0x7fbb58cf79e2: mov %rax,-0x18(%rbp)
0x7fbb58cf79e6: movabs $0x7fbb9c59bf90,%rax
0x7fbb58cf79f0: mov (%rax),%rax
0x7fbb58cf79f3: mov %rax,%rdx
0x7fbb58cf79f6: mov -0x18(%rbp),%rax
0x7fbb58cf79fa: add %rdx,%rax
0x7fbb58cf79fd: movl $0x1,(%rax)
0x7fbb58cf7a03: add $0x28,%rsp
problem os happened when all inst-s have NULL pointer for bytes
recreate_app : looking for 0x00007fbb598e42d8 in frag @ 0x00007fbb598e42b9 (tag 0x00007fb9511ce270)
size inst->bytes
cache pc 0x00007fbb598e42b9 vs 0x00007fbb598e42d8 9 0x0000000000000000
cache pc 0x00007fbb598e42c2 vs 0x00007fbb598e42d8 10 0x0000000000000000
cache pc 0x00007fbb598e42cc vs 0x00007fbb598e42d8 2 0x0000000000000000
cache pc 0x00007fbb598e42ce vs 0x00007fbb598e42d8 3 0x0000000000000000
cache pc 0x00007fbb598e42d1 vs 0x00007fbb598e42d8 2 0x0000000000000000
cache pc 0x00007fbb598e42d3 vs 0x00007fbb598e42d8 9 0x0000000000000000
cache pc 0x00007fbb598e42dc vs 0x00007fbb598e42d8 5 0x0000000000000000
recreate_app -- WARNING: cache pc 0x00007fbb598e42dc != 0x00007fbb598e42d8, probably prefix instruction
recreate_app -- invalid state: unsup=1 in-mangle=1 xl8=0x00007fb9511ce270 walk=0x00007fb9511ce270
recreate_app -- not able to fully recreate context, pc is in added instruction from mangling
1 recreate_app -- found ok pc 0x00007fb9511ce270
Is this fragment using stored translations, or did it recreate the list? (More log output should show this.) If it's stored translations this may be a bug in those and have nothing to do with the decoder/encoder.
Answering my own question: since you have an instrlist, it must not be using stored info, and you even listed recreate_app_state_from_ilist
above.
Re: instr_t.bytes
being NULL: I don't think that means much: for re-created-ilist app instrs that is probably what would be expected. It's instr_t.translation
that would point to the original app encoding. For synthetic instrs there are cases that do not cache the encoding so again bytes
being NULL is not necessarily an indication that something is wrong.
I was asking to dump all the fields of the instr_t
for the one that has the length of 9.
I was asking to dump all the fields of the
instr_t
for the one that has the length of 9.
3 examples
(gdb) print *inst
$1 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffdb0df34cc "\305\370wALJ\b\003", opcode = 56, rip_rel_pos = 0 '\000', num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {
kind = 5 '\005', size = 0 '\000', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0, shift = 0, flags = 0}, value = {immed_int = 2, immed_int_multi_part = {low = 2, high = 0}, immed_float = 2.80259693e-45,
immed_double = 9.8813129168249309e-324, pc = 0x2 <error: Cannot access memory at address 0x2>, instr = 0x2, reg = 2, base_disp = {disp = 2, base_reg = 0, index_reg = 0, scale = 0 '\000', encode_zero_disp = 0 '\000',
force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'}, addr = 0x2}}, srcs = 0x0, dsts = 0x7ffdb943da28}, label_data = {data = {5, 2, 0, 140727711685160}}}, prefixes = 0, eflags = 0,
note = 0x0, prev = 0x0, next = 0x7ffdb943ca70}
(gdb) x /i 0x7ffdb0df34cc
0x7ffdb0df34cc: vzeroupper
(gdb) x /i 0x00007fffb51fc781
0x7fffb51fc781: vzeroupper
(gdb) print len
$2 = 9
(gdb) x /3i 0x7fffb51fc781
0x7fffb51fc781: vzeroupper
0x7fffb51fc784: movl $0x5,0x308(%r15)
0x7fffb51fc78f: mov %r15d,%ecx
(gdb) x /4i 0x00007fffb4cc0ac1
0x7fffb4cc0ac1: mov %eax,-0x16000(%rsp)
0x7fffb4cc0ac8: push %rbp
0x7fffb4cc0ac9: mov %rsp,%rbp
0x7fffb4cc0acc: sub $0x10,%rsp
(gdb) x /4i 0x7ffdb0dd82b0
0x7ffdb0dd82b0: mov %eax,-0x16000(%rsp)
0x7ffdb0dd82b7: push %rbp
0x7ffdb0dd82b8: mov %rsp,%rbp
0x7ffdb0dd82bb: sub $0x10,%rsp
(gdb) print *inst
$3 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffdb0dd82b0 "\211\204$", opcode = 56, rip_rel_pos = 0 '\000',
num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {kind = 5 '\005', size = 0 '\000', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0, shift = 0, flags = 0},
value = {immed_int = 2, immed_int_multi_part = {low = 2, high = 0}, immed_float = 2.80259693e-45, immed_double = 9.8813129168249309e-324,
pc = 0x2 <error: Cannot access memory at address 0x2>, instr = 0x2, reg = 2, base_disp = {disp = 2, base_reg = 0, index_reg = 0, scale = 0 '\000',
encode_zero_disp = 0 '\000', force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'}, addr = 0x2}}, srcs = 0x0,
dsts = 0x7ffdb87183c0}, label_data = {data = {5, 2, 0, 140727697900480}}}, prefixes = 0, eflags = 0, note = 0x0, prev = 0x0, next = 0x7ffdba712a68}
(gdb) print len
$4 = 9
(gdb) print *inst
$1 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffff6d78bbb "H\215\005\246ƺ", opcode = 57,
rip_rel_pos = 0 '\000', num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {kind = 1 '\001', size = 6 '\006', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0,
shift = 0, flags = 0}, value = {immed_int = 140737346949736, immed_int_multi_part = {low = -141405592, high = 32767}, immed_float = -5.9355214e+33,
immed_double = 6.9533488214704906e-310, pc = 0x7ffff7925268 "", instr = 0x7ffff7925268, reg = 21096, base_disp = {disp = -141405592, base_reg = 255,
index_reg = 127, scale = 0 '\000', encode_zero_disp = 0 '\000', force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'},
addr = 0x7ffff7925268}}, srcs = 0x0, dsts = 0x7ffdb45e2bf8}, label_data = {data = {1537, 140737346949736, 0, 140727629523960}}}, prefixes = 0, eflags = 0,
note = 0x0, prev = 0x0, next = 0x7ffdb45e21e0}
(gdb) x /3i 0x7ffff6d78bbb
0x7ffff6d78bbb: lea 0xbac6a6(%rip),%rax # 0x7ffff7925268
0x7ffff6d78bc2: mov (%rax),%rax
0x7ffff6d78bc5: test %rax,%rax
(gdb) print len
$3 = 10
In addition, It's looks like a bug in third case : operation code of instruction = 57 (OP_mov_imm) , but why not 61(OP_lea)? And whats are the different beetwen OP_mov_ld and OP_mov_st , because in our cases frequently op_code = OP_mov_st or OP_mov_ld, when we see DR crash.
I don't understand the output in the prior comment: the cases above where the size is wrong (9 instead of 10 bytes) involve movabs
instructions like movabs $0x7fbb9c59bf90,%rax
. But the 3 cases at https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-909013554 are vzeroupper
, a store mov %eax,-0x16000(%rsp)
, and a lea
: none of which seem related to the problem we're trying to debug?
I don't understand the output in the prior comment: the cases above where the size is wrong (9 instead of 10 bytes) involve
movabs
instructions likemovabs $0x7fbb9c59bf90,%rax
. But the 3 cases at #3733 (comment) arevzeroupper
, a storemov %eax,-0x16000(%rsp)
, and alea
: none of which seem related to the problem we're trying to debug?
these are th same issue. We have the same crash, the same incorrect size calcaulation, the same null bytes for instructions. Kirill
Fox example, lea instruction case gdb shows that it takes 7b but DRIO calculation responds len=10 Kirill
What is this len
variable -- what is the callstack? What are the raw instruction bytes for these cases?
What is this
len
variable -- what is the callstack? What are the raw instruction bytes for these cases?
len is inside recreate_app_state_from_ilist
for (inst = instrlist_first(ilist); inst; inst = instr_get_next(inst)) {
int len = instr_length(tdcontext, inst);
What are the raw instruction bytes for these cases?
or we missed there. Let's us rerun benchmark and prepare another sample Kirill
DynamoRio output
cache pc 0x00007fffb4f44385 vs 0x00007fffb4f44394 INST_LEN = 9 ORIGINAL = 0x0000000000000000 THREAD = 0x00000000000052c0
cache pc 0x00007fffb4f4438e vs 0x00007fffb4f44394 INST_LEN = 10 ORIGINAL = 0x0000000000000000 THREAD = 0x00000000000052c0
cache pc 0x00007fffb4f44398 vs 0x00007fffb4f44394 INST_LEN = 2 ORIGINAL = 0x0000000000000000 THREAD = 0x00000000000052c0
Gdb output
0x7fffb4f44385: add $0x40,%rsp
0x7fffb4f44389: pop %rbp
0x7fffb4f4438a: movabs $0x7ffff7da8000,%r10
0x7fffb4f44394: test %eax,(%r10)
Raw bytes
0x7fffb4f44385: 0x48 0x83 0xc4 0x40 0x5d 0x49 0xba 0x00
0x7fffb4f4438d: 0x80 0xda 0xf7 0xff 0x7f 0x00 0x00 0x41
0x7fffb4f44395: 0x85 0x02
First suspicious intr
$1 = {flags = 2149646336, encoding_hints = 0, length = 0, {bytes = 0x0, label_cb = 0x0}, translation = 0x7ffdb0d8cf69 "H\203\304@]I\272", opcode = 56, rip_rel_pos = 0 '\000', num_dsts = 1 '\001', num_srcs = 1 '\001', {{src0 = {
kind = 5 '\005', size = 0 '\000', aux = {far_pc_seg_selector = 0, segment = 0, disp = 0, shift = 0, flags = 0}, value = {immed_int = 2, immed_int_multi_part = {low = 2, high = 0}, immed_float = 2.80259693e-45,
immed_double = 9.8813129168249309e-324, pc = 0x2 <error: Cannot access memory at address 0x2>, instr = 0x2, reg = 2, base_disp = {disp = 2, base_reg = 0, index_reg = 0, scale = 0 '\000', encode_zero_disp = 0 '\000',
force_full_disp = 0 '\000', disp_short_addr = 0 '\000', index_reg_is_zmm = 0 '\000'}, addr = 0x2}}, srcs = 0x0, dsts = 0x7ffdb967a498}, label_data = {data = {5, 2, 0, 140727714030744}}}, prefixes = 0, eflags = 0,
note = 0x0, prev = 0x0, next = 0x7ffdb9679ac0}
operation code of instruction = 57 (OP_mov_imm) , but why not 61(OP_lea)?
A rip-rel lea is mangled into mov_imm when the rip-rel doesn't reach from the code cache. The issue may involve inconsistent mangling.
The other cases are all DR-inserted mangling (hence why no bytes
value):
(gdb) p /x 2149646336
$1 = 0x80210000
=> INSTR_OUR_MANGLING
56 = OP_mov_st 2 == DR_REG_RCX
So a spill. Dest is out-of-line but presumably TLS.
So the recreated instrlist has extra/different instructions and that's why sizes do not match b/c it's looking at different instruction sequences? So it's not anything with decoder/encoder sizes, it's different instructions. Is it inconsistent rip-rel mangling, or is it more than that -- Is this jitted app code and sthg changed but wasn't detected properly?
Dumping the full recreated instrlist would be helpful: instrlist_disassemble()
, and comparing to the full block's original app code and full fragment in the cache.
Dumping the full recreated instrlist would be helpful:
instrlist_disassemble()
, and comparing to the full block's original app code and full fragment in the cache.
The one strange thing that instrlist_disassemble doesn't stop on jmp +35 L4 @0x00007f531de07958 e9 1f e2 3b ef jmp $0x00007f530d1da134
recreate_app : looking for 0x00007f5517e73d58 in frag @ 0x00007f5517e73d49 (tag 0x00007f530d1da134)
TAG 0x00007f530d1da134
+0 m4 @0x00007f531de09cd0 65 48 89 0c 25 10 00 mov %rcx -> %gs:0x10[8byte]
00 00
+9 m4 @0x00007f531de08d30 48 b9 00 00 00 00 00 mov $0x0000000000000000 -> %rcx
00 00 00
+19 m4 @0x00007f531de08ac8 ff 01 inc (%rcx)[4byte] -> (%rcx)[4byte]
+21 m4 @0x00007f531de05378 83 39 14 cmp (%rcx)[4byte] $0x00000014
+24 m4 @0x00007f531de07048 7c fe jl @0x00007f531de075f0[8byte]
+26 m4 @0x00007f531de08cb0 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+35 L4 @0x00007f531de07958 e9 1f e2 3b ef jmp $0x00007f530d1da134
+40 m4 @0x00007f531de075f0 <label>
+40 m4 @0x00007f531de07d28 65 48 89 34 25 08 00 mov %rsi -> %gs:0x08[8byte]
00 00
+49 m4 @0x00007f5320906040 65 48 89 3c 25 18 00 mov %rdi -> %gs:0x18[8byte]
00 00
+58 m4 @0x00007f531de04b58 48 be 34 a1 1d 0d 53 mov $0x00007f530d1da134 -> %rsi
7f 00 00
+68 m4 @0x00007f53209060a8 48 bf 34 a1 1d 0d 53 mov $0x00007f530d1da134 -> %rdi
7f 00 00
+78 m4 @0x00007f531de08010 a6 cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi -> %rsi %rdi
+79 m4 @0x00007f531de04e28 0f 85 fa ff ff ff jnz @0x00007f531de05ce8[8byte]
+85 m4 @0x00007f531de08578 48 b9 34 a1 1d 0d 53 mov $0x00007f530d1da134 -> %rcx
7f 00 00
+95 m4 @0x00007f531de08960 48 3b f1 cmp %rsi %rcx
+98 m4 @0x00007f531de061a0 48 b9 12 00 00 00 00 mov $0x0000000000000012 -> %rcx
00 00 00
+108 m4 @0x00007f531de05ed0 0f 8d fa ff ff ff jnl @0x00007f531de09ad0[8byte]
+114 m4 @0x00007f531de04d58 48 bf 46 a1 1d 0d 53 mov $0x00007f530d1da146 -> %rdi
7f 00 00
+124 m4 @0x00007f531de098e8 48 be 46 a1 1d 0d 53 mov $0x00007f530d1da146 -> %rsi
7f 00 00
+134 m4 @0x00007f531de09ad0 <label>
+134 m4 @0x00007f531de05b00 f3 a6 rep cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi %rcx -> %rsi %rdi %rcx
+136 m4 @0x00007f531de05ce8 <label>
+136 m4 @0x00007f531de06fc8 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+145 m4 @0x00007f531de084f8 65 48 8b 34 25 08 00 mov %gs:0x08[8byte] -> %rsi
00 00
+154 m4 @0x00007f531de086e0 65 48 8b 3c 25 18 00 mov %gs:0x18[8byte] -> %rdi
00 00
+163 L4 @0x00007f531de07288 0f 85 1e e2 3b ef jnz $0x00007f530d1da134
+169 L3 @0x00007f531de09e38 48 83 c4 30 add $0x0000000000000030 %rsp -> %rsp
+173 L3 @0x00007f531de05560 5d pop %rsp (%rsp)[8byte] -> %rbp %rsp
+174 L3 @0x00007f531de054f8 49 ba 00 90 cb 5a 55 mov $0x00007f555acb9000 -> %r10
7f 00 00
+184 L3 @0x00007f53209062a8 41 85 02 test (%r10)[4byte] %eax
+187 m4 @0x00007f531de07188 65 48 89 0c 25 10 00 mov %rcx -> %gs:0x10[8byte]
00 00
+196 m4 @0x00007f531de07ad8 59 pop %rsp (%rsp)[8byte] -> %rcx %rsp
+197 L4 @0x00007f531de088e0 e9 eb 08 ce f8 jmp $0x00007f5516afc800 <shared_bb_ibl_ret>
END 0x00007f530d1da134
cache pc 0x00007f5517e73d49 vs 0x00007f5517e73d58 9 0x0000000000000000
cache pc 0x00007f5517e73d52 vs 0x00007f5517e73d58 10 0x0000000000000000
cache pc 0x00007f5517e73d5c vs 0x00007f5517e73d58 2 0x0000000000000000
recreate_app -- WARNING: cache pc 0x00007f5517e73d5c != 0x00007f5517e73d58, probably prefix instruction
2 recreate_app -- found valid state pc 0x00007f530d1da134
1 recreate_app -- found ok pc 0x00007f530d1da134
(gdb) x /34i 0x00007f5517e73d49
0x7f5517e73d49: add $0x30,%rsp
0x7f5517e73d4d: pop %rbp
0x7f5517e73d4e: movabs $0x7f555acb9000,%r10
0x7f5517e73d58: test %eax,(%r10)
0x7f5517e73d5b: mov %rcx,%gs:0x10
0x7f5517e73d64: pop %rcx
0x7f5517e73d65: jmpq 0x7f5516afc8ab
0x7f5517e73d6a: and %bh,%ah
0x7f5517e73d6c: rex.W stos %al,%es:(%rdi)
0x7f5517e73d6e: xchg %ah,(%rax)
0x7f5517e73d70: push %rbx
0x7f5517e73d71: jg 0x7f5517e73d73
0x7f5517e73d73: add %ah,0x65(%rdi)
0x7f5517e73d76: movabs 0xc8b486500000000,%rax
0x7f5517e73d80: and $0x10,%eax
0x7f5517e73d85: pop %r11
0x7f5517e73d87: pop %rsp
0x7f5517e73d88: pop %r10
0x7f5517e73d8a: pop %r9
0x7f5517e73d8c: pop %r8
0x7f5517e73d8e: pop %rcx
0x7f5517e73d8f: pop %rdx
0x7f5517e73d90: pop %rsi
0x7f5517e73d91: pop %rdi
0x7f5517e73d92: cmp %rax,%r15
0x7f5517e73d95: je 0x7f5517ac64c1
0x7f5517e73d9b: jmpq 0x7f5517ac64c1
0x7f5517e73da0: shrb $0x0,0x7f532086(%rdx)
0x7f5517e73da7: add %ah,0x65(%rdi)
0x7f5517e73daa: movabs 0xc8b486500000000,%rax
0x7f5517e73db4: and $0x10,%eax
0x7f5517e73db9: pop %rax
0x7f5517e73dba: movabs $0x0,%r10
0x7f5517e73dc4: mov %r10,0x270(%r15)
(gdb) x /200b 0x00007f5517e73d49
0x7f5517e73d49: 0x48 0x83 0xc4 0x30 0x5d 0x49 0xba 0x00
0x7f5517e73d51: 0x90 0xcb 0x5a 0x55 0x7f 0x00 0x00 0x41
0x7f5517e73d59: 0x85 0x02 0x65 0x48 0x89 0x0c 0x25 0x10
0x7f5517e73d61: 0x00 0x00 0x00 0x59 0xe9 0x41 0x8b 0xc8
0x7f5517e73d69: 0xfe 0x20 0xfc 0x48 0xaa 0x86 0x20 0x53
0x7f5517e73d71: 0x7f 0x00 0x00 0x67 0x65 0x48 0xa1 0x00
0x7f5517e73d79: 0x00 0x00 0x00 0x65 0x48 0x8b 0x0c 0x25
0x7f5517e73d81: 0x10 0x00 0x00 0x00 0x41 0x5b 0x5c 0x41
0x7f5517e73d89: 0x5a 0x41 0x59 0x41 0x58 0x59 0x5a 0x5e
0x7f5517e73d91: 0x5f 0x4c 0x3b 0xf8 0x0f 0x84 0x26 0x27
0x7f5517e73d99: 0xc5 0xff 0xe9 0x21 0x27 0xc5 0xff 0xc0
0x7f5517e73da1: 0xaa 0x86 0x20 0x53 0x7f 0x00 0x00 0x67
0x7f5517e73da9: 0x65 0x48 0xa1 0x00 0x00 0x00 0x00 0x65
0x7f5517e73db1: 0x48 0x8b 0x0c 0x25 0x10 0x00 0x00 0x00
0x7f5517e73db9: 0x58 0x49 0xba 0x00 0x00 0x00 0x00 0x00
0x7f5517e73dc1: 0x00 0x00 0x00 0x4d 0x89 0x97 0x70 0x02
0x7f5517e73dc9: 0x00 0x00 0x49 0xba 0x00 0x00 0x00 0x00
0x7f5517e73dd1: 0x00 0x00 0x00 0x00 0x4d 0x89 0x97 0x80
0x7f5517e73dd9: 0x02 0x00 0x00 0x49 0xba 0x00 0x00 0x00
0x7f5517e73de1: 0x00 0x00 0x00 0x00 0x00 0x4d 0x89 0x97
0x7f5517e73de9: 0x78 0x02 0x00 0x00 0x49 0x81 0x7f 0x08
0x7f5517e73df1: 0x00 0x00 0x00 0x00 0x0f 0x84 0xf4 0x26
0x7f5517e73df9: 0xc5 0xff 0xe9 0xef 0x26 0xc5 0xff 0x28
0x7f5517e73e01: 0x0e 0x6f 0x1a 0x53 0x7f 0x00 0x00 0x04
0x7f5517e73e09: 0x7f 0x9e 0x67 0x65 0x48 0xa1 0x00 0x00
(gdb) x /34i 0x00007f530d1da134
0x7f530d1da134: add $0x30,%rsp
0x7f530d1da138: pop %rbp
0x7f530d1da139: movabs $0x7f555acb9000,%r10
0x7f530d1da143: test %eax,(%r10)
0x7f530d1da146: retq
0x7f530d1da147: test %rdx,%rdx
0x7f530d1da14a: je 0x7f530d1da230
0x7f530d1da150: mov (%rdx),%r10
0x7f530d1da153: mov %r10,%r11
0x7f530d1da156: and $0x7,%r11
0x7f530d1da15a: cmp $0x1,%r11
0x7f530d1da15e: jne 0x7f530d1da237
0x7f530d1da164: shr $0x8,%r10
0x7f530d1da168: mov %r10d,%eax
0x7f530d1da16b: and $0x7fffffff,%eax
0x7f530d1da171: test %eax,%eax
0x7f530d1da173: je 0x7f530d1da237
0x7f530d1da179: mov 0x8(%rsp),%r10
0x7f530d1da17e: mov 0x20(%r10),%r10
0x7f530d1da182: mov 0x10(%r10),%r11d
0x7f530d1da186: and $0x7fffffff,%eax
0x7f530d1da18c: test %r11d,%r11d
0x7f530d1da18f: je 0x7f530d1da2a1
0x7f530d1da195: cmp $0x80000000,%eax
0x7f530d1da19a: jne 0x7f530d1da1a4
0x7f530d1da19c: xor %edx,%edx
0x7f530d1da19e: cmp $0xffffffff,%r11d
0x7f530d1da1a2: je 0x7f530d1da1a8
0x7f530d1da1a4: cltd
0x7f530d1da1a5: idiv %r11d
0x7f530d1da1a8: mov 0x18(%r10,%rdx,4),%eax
0x7f530d1da1ad: test %eax,%eax
0x7f530d1da1af: jl 0x7f530d1da226
0x7f530d1da1b5: mov 0x8(%rsp),%r10
(gdb) x /200b 0x00007f530d1da134
0x7f530d1da134: 0x48 0x83 0xc4 0x30 0x5d 0x49 0xba 0x00
0x7f530d1da13c: 0x90 0xcb 0x5a 0x55 0x7f 0x00 0x00 0x41
0x7f530d1da144: 0x85 0x02 0xc3 0x48 0x85 0xd2 0x0f 0x84
0x7f530d1da14c: 0xe0 0x00 0x00 0x00 0x4c 0x8b 0x12 0x4d
0x7f530d1da154: 0x8b 0xda 0x49 0x83 0xe3 0x07 0x49 0x83
0x7f530d1da15c: 0xfb 0x01 0x0f 0x85 0xd3 0x00 0x00 0x00
0x7f530d1da164: 0x49 0xc1 0xea 0x08 0x41 0x8b 0xc2 0x81
0x7f530d1da16c: 0xe0 0xff 0xff 0xff 0x7f 0x85 0xc0 0x0f
0x7f530d1da174: 0x84 0xbe 0x00 0x00 0x00 0x4c 0x8b 0x54
0x7f530d1da17c: 0x24 0x08 0x4d 0x8b 0x52 0x20 0x45 0x8b
0x7f530d1da184: 0x5a 0x10 0x81 0xe0 0xff 0xff 0xff 0x7f
0x7f530d1da18c: 0x45 0x85 0xdb 0x0f 0x84 0x0c 0x01 0x00
0x7f530d1da194: 0x00 0x3d 0x00 0x00 0x00 0x80 0x75 0x08
0x7f530d1da19c: 0x33 0xd2 0x41 0x83 0xfb 0xff 0x74 0x04
0x7f530d1da1a4: 0x99 0x41 0xf7 0xfb 0x41 0x8b 0x44 0x92
0x7f530d1da1ac: 0x18 0x85 0xc0 0x0f 0x8c 0x71 0x00 0x00
0x7f530d1da1b4: 0x00 0x4c 0x8b 0x54 0x24 0x08 0x4d 0x8b
0x7f530d1da1bc: 0x4a 0x30 0x41 0x8b 0x49 0x10 0x3b 0xc1
0x7f530d1da1c4: 0x0f 0x83 0x83 0x00 0x00 0x00 0x4d 0x8b
0x7f530d1da1cc: 0x54 0xc1 0x18 0x4c 0x3b 0x54 0x24 0x10
0x7f530d1da1d4: 0x0f 0x84 0x5a 0xff 0xff 0xff 0x4c 0x8b
0x7f530d1da1dc: 0x54 0x24 0x08 0x49 0x8b 0x6a 0x28 0x44
0x7f530d1da1e4: 0x8b 0x55 0x10 0x41 0x3b 0xc2 0x0f 0x83
0x7f530d1da1ec: 0x96 0x00 0x00 0x00 0x8b 0x44 0x85 0x18
0x7f530d1da1f4: 0x85 0xc0 0x7c 0x2e 0x49 0xbb 0x00 0x90
Looks ilist have the set of prepend incorrect instructions because instrlist_disassemble() is matched with real instructions from
+169 L3 @0x00007f531de09e38 48 83 c4 30 add $0x0000000000000030 %rsp -> %rsp
+173 L3 @0x00007f531de05560 5d pop %rsp (%rsp)[8byte] -> %rbp %rsp
+174 L3 @0x00007f531de054f8 49 ba 00 90 cb 5a 55 mov $0x00007f555acb9000 -> %r10
7f 00 00
+184 L3 @0x00007f53209062a8 41 85 02 test (%r10)[4byte] %eax
+187 m4 @0x00007f531de07188 65 48 89 0c 25 10 00 mov %rcx -> %gs:0x10[8byte]
00 00
+196 m4 @0x00007f531de07ad8 59 pop %rsp (%rsp)[8byte] -> %rcx %rsp
+197 L4 @0x00007f531de088e0 e9 eb 08 ce f8 jmp $0x00007f5516afc800 <shared_bb_ibl_ret>
Looks like mangle_bb_ilist added extra 169 bytes for instructions before. So, when we move from ilist, we start from OLD address frag @ 0x00007f5517e73d49 but use mangling instructions length Kirill
This:
recreate_app : looking for 0x00007f5517e73d58 in frag @ 0x00007f5517e73d49 (tag 0x00007f530d1da134)
TAG 0x00007f530d1da134
+0 m4 @0x00007f531de09cd0 65 48 89 0c 25 10 00 mov %rcx -> %gs:0x10[8byte]
00 00
+9 m4 @0x00007f531de08d30 48 b9 00 00 00 00 00 mov $0x0000000000000000 -> %rcx
00 00 00
+19 m4 @0x00007f531de08ac8 ff 01 inc (%rcx)[4byte] -> (%rcx)[4byte]
+21 m4 @0x00007f531de05378 83 39 14 cmp (%rcx)[4byte] $0x00000014
+24 m4 @0x00007f531de07048 7c fe jl @0x00007f531de075f0[8byte]
+26 m4 @0x00007f531de08cb0 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+35 L4 @0x00007f531de07958 e9 1f e2 3b ef jmp $0x00007f530d1da134
+40 m4 @0x00007f531de075f0 <label>
+40 m4 @0x00007f531de07d28 65 48 89 34 25 08 00 mov %rsi -> %gs:0x08[8byte]
00 00
...
Looks like the mangling added for selfmod sandboxing. -sandbox2ro_threshold is 20==0x14. (Note all the "m4": that's a meta level 4 instr so not coming from the app but from DR's added mangling.)
So the problem is the recreate thinks there should be sandboxing mangling while the actual fragment in the code cache does not have such mangling? Look at the logs around this sandbox2ro threshold and swapping between sandboxing code from writable app pages and marking app pages read-only. What are the page protections on this app code? What happens w/ -sandbox2ro_threshold 0 -ro2sandbox_threshold 0
?
The one strange thing that instrlist_disassemble doesn't stop on jmp +35 L4 @0x00007f531de07958 e9 1f e2 3b ef jmp $0x00007f530d1da134
I believe that's marked as an app instr to create an exit from the fragment if the threshold is reached: so it's really a synthetic jump that's not part of the original app code.
So the problem is the recreate thinks there should be sandboxing mangling while the actual fragment in the code cache does not have such mangling?
Yes. For example, for this one we have sanboxes instructions. but if I dump bytes before code cache address 0x00007f809098a5ed they are not the same.
(gdb) x /300b (0x00007f809098a5ed-176)
0x7f809098a53d: 0x00 0x65 0x48 0x89 0x0c 0x25 0x10 0x00
0x7f809098a545: 0x00 0x00 0x49 0x8b 0xca 0x68 0xbd 0x71
0x7f809098a54d: 0x67 0x8c 0xc7 0x44 0x24 0x04 0x7e 0x7f
0x7f809098a555: 0x00 0x00 0xe9 0x83 0x74 0xcb 0xfe 0x30
0x7f809098a55d: 0x67 0x1c 0x93 0x7e 0x7f 0x00 0x00 0x67
0x7f809098a565: 0x65 0x48 0xa1 0x00 0x00 0x00 0x00 0x65
0x7f809098a56d: 0x48 0x8b 0x0c 0x25 0x10 0x00 0x00 0x00
0x7f809098a575: 0x41 0x5b 0x5c 0x41 0x5a 0x41 0x59 0x41
0x7f809098a57d: 0x58 0x59 0x5a 0x5e 0x5f 0x4c 0x3b 0xf8
0x7f809098a585: 0x0f 0x84 0x5a 0xda 0xf6 0xff 0xe9 0x55
0x7f809098a58d: 0xda 0xf6 0xff 0x70 0xc1 0x22 0x95 0x7e
0x7f809098a595: 0x7f 0x00 0x00 0x67 0x65 0x48 0xa1 0x00
0x7f809098a59d: 0x00 0x00 0x00 0x65 0x48 0x8b 0x0c 0x25
0x7f809098a5a5: 0x10 0x00 0x00 0x00 0x66 0x8b 0x47 0x08
0x7f809098a5ad: 0x66 0x89 0x46 0x08 0x49 0xba 0x98 0x5d
0x7f809098a5b5: 0x37 0xd3 0x80 0x7f 0x00 0x00 0x41 0xff
0x7f809098a5bd: 0x02 0x48 0x33 0xc0 0xc9 0x65 0x48 0x89
0x7f809098a5c5: 0x0c 0x25 0x10 0x00 0x00 0x00 0x59 0xe9
0x7f809098a5cd: 0xda 0x72 0xcb 0xfe 0x30 0x00 0x00 0xb0
0x7f809098a5d5: 0xc1 0x22 0x95 0x7e 0x7f 0x00 0x00 0x67
0x7f809098a5dd: 0x65 0x48 0xa1 0x00 0x00 0x00 0x00 0x65
0x7f809098a5e5: 0x48 0x8b 0x0c 0x25 0x10 0x00 0x00 0x00
0x7f809098a5ed: 0x48 0x83 0xc4 0x30 0x5d 0x49 0xba 0x00
0x7f809098a5f5: 0xe0 0x7f 0xd3 0x80 0x7f 0x00 0x00 0x41
0x7f809098a5fd: 0x85 0x02 0x65 0x48 0x89 0x0c 0x25 0x10
0x7f809098a605: 0x00 0x00 0x00 0x59 0xe9 0x9d 0x72 0xcb
0x7f809098a60d: 0xfe 0x6d 0x90 0x90 0x2d 0x47 0x99 0x7e
recreate_app : looking for 0x00007f809098a5fc in frag @ 0x00007f809098a5ed (tag 0x00007f7e8c7f1938)
recreate_app : pc is in F(0x00007f7e8c7f1938)
TAG 0x00007f7e8c7f1938
+0 m4 @0x00007f7e95cb0518 opcode=56 65 48 89 0c 25 10 00 mov %rcx -> %gs:0x10[8byte]
00 00
+9 m4 @0x00007f7e95cac158 opcode=57 48 b9 00 00 00 00 00 mov $0x0000000000000000 -> %rcx
00 00 00
+19 m4 @0x00007f7e95cafd38 opcode=16 ff 01 inc (%rcx)[4byte] -> (%rcx)[4byte]
+21 m4 @0x00007f7e994ef0c0 opcode=14 83 39 14 cmp (%rcx)[4byte] $0x00000014
+24 m4 @0x00007f7e95cae2a8 opcode=38 7c fe jl @0x00007f7e95cad870[8byte]
+26 m4 @0x00007f7e95cad708 opcode=55 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+35 L4 @0x00007f7e95cabff0 opcode=46 e9 33 ea b2 f6 jmp $0x00007f7e8c7f1938
+40 m4 @0x00007f7e95cad870 opcode=3 <label>
+40 m4 @0x00007f7e95cabdc0 opcode=56 65 48 89 34 25 08 00 mov %rsi -> %gs:0x08[8byte]
00 00
+49 m4 @0x00007f7e95cac1c0 opcode=56 65 48 89 3c 25 18 00 mov %rdi -> %gs:0x18[8byte]
00 00
+58 m4 @0x00007f7e95cb0dd0 opcode=57 48 be 38 19 7f 8c 7e mov $0x00007f7e8c7f1938 -> %rsi
7f 00 00
+68 m4 @0x00007f7e95cac0d8 opcode=57 48 bf 38 19 7f 8c 7e mov $0x00007f7e8c7f1938 -> %rdi
7f 00 00
+78 m4 @0x00007f7e95cac7c0 opcode=393 a6 cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi -> %rsi %rdi
+79 m4 @0x00007f7e95cad2e8 opcode=157 0f 85 fa ff ff ff jnz @0x00007f7e95cabe28[8byte]
+85 m4 @0x00007f7e95cae1d8 opcode=57 48 b9 38 19 7f 8c 7e mov $0x00007f7e8c7f1938 -> %rcx
7f 00 00
+95 m4 @0x00007f7e95cacaa8 opcode=14 48 3b f1 cmp %rsi %rcx
+98 m4 @0x00007f7e95cafb38 opcode=57 48 b9 12 00 00 00 00 mov $0x0000000000000012 -> %rcx
00 00 00
+108 m4 @0x00007f7e95cabc28 opcode=165 0f 8d fa ff ff ff jnl @0x00007f7e95cad8f0[8byte]
+114 m4 @0x00007f7e95cadf28 opcode=57 48 bf 4a 19 7f 8c 7e mov $0x00007f7e8c7f194a -> %rdi
7f 00 00
+124 m4 @0x00007f7e95cad808 opcode=57 48 be 4a 19 7f 8c 7e mov $0x00007f7e8c7f194a -> %rsi
7f 00 00
+134 m4 @0x00007f7e95cad8f0 opcode=3 <label>
+134 m4 @0x00007f7e95cae240 opcode=394 f3 a6 rep cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi %rcx -> %rsi %rdi %rcx
+136 m4 @0x00007f7e95cabe28 opcode=3 <label>
+136 m4 @0x00007f7e95cad608 opcode=55 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+145 m4 @0x00007f7e95cabb58 opcode=55 65 48 8b 34 25 08 00 mov %gs:0x08[8byte] -> %rsi
00 00
+154 m4 @0x00007f7e95cad588 opcode=55 65 48 8b 3c 25 18 00 mov %gs:0x18[8byte] -> %rdi
00 00
+163 L4 @0x00007f7e95cac4f0 opcode=157 0f 85 32 ea b2 f6 jnz $0x00007f7e8c7f1938
+169 L3 @0x00007f7e95cade10 opcode=4 48 83 c4 30 add $0x0000000000000030 %rsp -> %rsp
+173 L3 @0x00007f7e95cb0968 opcode=20 5d pop %rsp (%rsp)[8byte] -> %rbp %rsp
+174 L3 @0x00007f7e95cacd90 opcode=57 49 ba 00 e0 7f d3 80 mov $0x00007f80d37fe000 -> %r10
7f 00 00
+184 L3 @0x00007f7e95cac488 opcode=60 41 85 02 test (%r10)[4byte] %eax
+187 m4 @0x00007f7e95cabbc0 opcode=56 65 48 89 0c 25 10 00 mov %rcx -> %gs:0x10[8byte]
00 00
+196 m4 @0x00007f7e994ef590 opcode=20 59 pop %rsp (%rsp)[8byte] -> %rcx %rsp
+197 L4 @0x00007f7e95cabd58 opcode=46 e9 fb e8 97 f9 jmp $0x00007f808f641800 <shared_bb_ibl_ret>
END 0x00007f7e8c7f1938
What are the page protections on this app code?
tag 0x00007f7e8c7f1938
cat /proc/401468/maps
...
7f7e8c62e000-7f7e8c89e000 rwxp 00000000 00:00 0.
...
What happens w/
-sandbox2ro_threshold 0 -ro2sandbox_threshold 0
?
No crashes with these options. Kirill
We need to figure out the timing here: was this fragment flushed (for an ro2sandbox transition) but not yet fully deleted and the translation request came in for the half-deleted fragment after the page was made writable?
Xref -safe_translate_flushed
which IIRC was supposed to solve such issues but never enabled by default b/c of performance problems. Or is the sandboxing decision supposed to come from the fragment flags and not the vmareas? Not remembering the details.
Xref
-safe_translate_flushed
In my case, this option doesn't work at all. I have hang at the beginning of benchmark run with 2 java threads. (gdb) info threads Id Target Id Frame
Xref
-safe_translate_flushed
In my case, this option doesn't work at all. I have hang at the beginning of benchmark run with 2 java threads. (gdb) info threads Id Target Id Frame
- 1 LWP 466688 "java" 0x00007f67f4225b82 in ?? () 2 LWP 466689 "java" 0x00007f683854a7ea in ?? () Kirill
Hi, @derekbruening
Show we investigate anything else? Or could use -sandbox2ro_threshold 0 -ro2sandbox_threshold 0
options? Is it ok?
Thanks, Kirill
Show we investigate anything else? Or could use -sandbox2ro_threshold 0 -ro2sandbox_threshold 0 options? Is it ok?
Disabling those parameters (by setting to 0) should work correctly but may have extra overhead.
It would be good to confirm that the problem with those parameters being enabled is indeed a half-deleted fragment: if logs are available, look for an entry for the fragment with the translation problem being unlinked or other steps toward deletion prior to the translation issue.
Disabling those parameters (by setting to 0) should work correctly but may have extra overhead.
It would be good to confirm that the problem with those parameters being enabled is indeed a half-deleted fragment: if logs are available, look for an entry for the fragment with the translation problem being unlinked or other steps toward deletion prior to the translation issue.
Could not enable full logging because huge time for reproducing in debug mode. Try to add logs in fragment_prepare_for_removal_from_table
dr_fprintf(STDERR,
"fragment_prepare_for_removal_from_table: remove frag @@" PFX " (tag " PFX ")\n",
f->start_pc, f->tag);
bad fragment
recreate_app : looking for 0x00007ff8245da446 in frag @ 0x00007ff8245da41d (tag 0x00007ff6202cf0e2)
there are no fragment_prepare_for_removal_from_table: remove frag @@
logs with 0x00007ff8245da41d
Or am I wrong? Need to add anything else to catch removing? Kirill
I think adding logging in fragment_unlink_for_deletion() would better since these may not be indirect branch targets.
I think adding logging in fragment_unlink_for_deletion() would better since these may not be indirect branch targets.
Still no removing.
For example,
recreate_app : looking for 0x00007fc205ed319b in frag @ 0x00007fc205ed3191
(tag 0x00007fbffd1cdfbc)
there are a few fragment unlinking for this tag for another fragments but not for 0x00007fc205ed3191
fragment_unlink_for_deletion: remove frag @ 0x00007fc205e9ff48 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205f1bcc8 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ec2d14 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ebca1d (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205f474e4 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ee57b4 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ee1da8 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ea3314 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ee8bcc (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ecfbc8 (tag 0x00007fbffd1cdfbc)
fragment_unlink_for_deletion: remove frag @ 0x00007fc205ed3180 (tag 0x00007fbffd1cdfbc)
Kirill
just observation
This is 1st fragment (for fragments when we handle signals) where bb->flags include FRAG_SELFMOD_SANDBOXED
master_signal_handler_C()
record_pending_signal()
translate_sigcontext()
translate_mcontext()
recreate_app_state()
recreate_app_state_internal()
recreate_fragment_ilist
recreate_bb_ilist
build_bb_ilist
check_new_page_start
check_thread_vm_area -> set flags to FRAG_SELFMOD_SANDBOXED
Kirill
This is 1st fragment (for fragments when we handle signals) where bb->flags include FRAG_SELFMOD_SANDBOXED
But that flag was not set when the fragment was first created -- we need to figure out how the vmarea had its flags changed without flushing all the fragments inside (since we already looked for this fragment being partially deleted from being flushed). If -logmask LOG_VMAREAS
is too much output I guess targeted logs on vmareas being marked as sandboxed would be needed to try and figure out the timing.
This is 1st fragment (for fragments when we handle signals) where bb->flags include FRAG_SELFMOD_SANDBOXED
But that flag was not set when the fragment was first created -- we need to figure out how the vmarea had its flags changed without flushing all the fragments inside (since we already looked for this fragment being partially deleted from being flushed). If
-logmask LOG_VMAREAS
is too much output I guess targeted logs on vmareas being marked as sandboxed would be needed to try and figure out the timing.
DRIO doesn't mark vmarea as sandboxed, DRIO sets this flag for fragment in check_thread_vm_area() directly
if (ok && ro2s->written_count >= DYNAMO_OPTION(ro2sandbox_threshold)) {
...
frag_flags |= SANDBOX_FLAG();
...
Kirill
DRIO doesn't mark vmarea as sandboxed, DRIO sets this flag for fragment in check_thread_vm_area() directly
You mean, when the fragment is created that threshold has not been crossed, but when it recreates the fragment the threshold has been crossed (due to some concurrent execution in another thread or something)? But this code you've quoted is only entered when an area is not on the executable list: which means it was removed on a flush (or it's the very first execution for non-ELF-image regions). But you saw no flush? Maybe re-search for a flush: look for flush_fragments_in_region_start
maybe.
DRIO doesn't mark vmarea as sandboxed, DRIO sets this flag for fragment in check_thread_vm_area() directly
You mean, when the fragment is created that threshold has not been crossed, but when it recreates the fragment the threshold has been crossed (due to some concurrent execution in another thread or something)? But this code you've quoted is only entered when an area is not on the executable list: which means it was removed on a flush (or it's the very first execution for non-ELF-image regions). But you saw no flush? Maybe re-search for a flush: look for
flush_fragments_in_region_start
maybe.
Added log at the top of flush_fragments_in_region_start.
on my last run I had issue with
recreate_app : looking for 0x00007f5beb76829d in frag @ 0x00007f5beb768299 (tag 0x00007f59e1093e1b)
tag is included to the region that was flushed before that point
FLUSH flush_fragments_in_region_start (thread 1468315 flushtime 3972): 0x00007f59e1000000-
0x00007f59e1270000
new executable area 0x00007f59e1000000-0x00007f59e1270000 written >= 10X => switch to sandboxing
I've tried to search the same tag before we had the similar fragment
FLUSH flush_fragments_in_region_start (thread 1468315 flushtime 3961): 0x00007f59e1000000-0x00007f59e1270000
FLUSH flush_fragments_in_region_start (thread 1468339 flushtime 3962): 0x00007f59e1012000-0x00007f59e1013000
FLUSH flush_fragments_in_region_start (thread 1468347 flushtime 3962): 0x00007f59e1052000-0x00007f59e1053000
FLUSH flush_fragments_in_region_start (thread 1468342 flushtime 3962): 0x00007f59e109e000-0x00007f59e109f000
FLUSH flush_fragments_in_region_start (thread 1468346 flushtime 3962): 0x00007f59e1097000-0x00007f59e1098000
FLUSH flush_fragments_in_region_start (thread 1468344 flushtime 3962): 0x00007f59e1052000-0x00007f59e1053000
recreate_app : looking for 0x00007f5beb49e87a in frag @ 0x00007f5beb49e7cd (tag 0x00007f59e1093e1b)
but it was ok and address was matched
TAG 0x00007f59e1093e1b
+0 m4 @0x00007f59f0698978 opcode=56 65 48 89 0c 25 10 00 mov %rcx -> %gs:0x10[8byte]
00 00
+9 m4 @0x00007f59f069a660 opcode=57 48 b9 00 00 00 00 00 mov $0x0000000000000000 -> %rcx
00 00 00
+19 m4 @0x00007f59f069a290 opcode=16 ff 01 inc (%rcx)[4byte] -> (%rcx)[4byte]
+21 m4 @0x00007f59f0697498 opcode=14 83 39 14 cmp (%rcx)[4byte] $0x00000014
+24 m4 @0x00007f59f069a9c8 opcode=38 7c fe jl @0x00007f59f0695f20[8byte]
+26 m4 @0x00007f59f0698020 opcode=55 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+35 L4 @0x00007f59f16eb058 opcode=46 e9 16 6f 9e f0 jmp $0x00007f59e1093e1b
+40 m4 @0x00007f59f0695f20 opcode=3 <label>
+40 m4 @0x00007f59f0698c50 opcode=56 65 48 89 34 25 08 00 mov %rsi -> %gs:0x08[8byte]
00 00
+49 m4 @0x00007f59f0696fe0 opcode=56 65 48 89 3c 25 18 00 mov %rdi -> %gs:0x18[8byte]
00 00
+58 m4 @0x00007f59f0697430 opcode=57 48 be 1b 3e 09 e1 59 mov $0x00007f59e1093e1b -> %rsi
7f 00 00
+68 m4 @0x00007f59f069af30 opcode=57 48 bf 1b 3e 09 e1 59 mov $0x00007f59e1093e1b -> %rdi
7f 00 00
+78 m4 @0x00007f59f0696718 opcode=393 a6 cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi -> %rsi %rdi
+79 m4 @0x00007f59f0696c90 opcode=157 0f 85 fa ff ff ff jnz @0x00007f59f069a6e0[8byte]
+85 m4 @0x00007f59f0695f88 opcode=57 48 b9 1b 3e 09 e1 59 mov $0x00007f59e1093e1b -> %rcx
7f 00 00
+95 m4 @0x00007f59f06964c8 opcode=14 48 3b f1 cmp %rsi %rcx
+98 m4 @0x00007f59f06993e0 opcode=57 48 b9 0a 00 00 00 00 mov $0x000000000000000a -> %rcx
00 00 00
+108 m4 @0x00007f59f0695c28 opcode=165 0f 8d fa ff ff ff jnl @0x00007f59f0699178[8byte]
+114 m4 @0x00007f59f06962e0 opcode=57 48 bf 25 3e 09 e1 59 mov $0x00007f59e1093e25 -> %rdi
7f 00 00
+124 m4 @0x00007f59f06983f0 opcode=57 48 be 25 3e 09 e1 59 mov $0x00007f59e1093e25 -> %rsi
7f 00 00
+134 m4 @0x00007f59f0699178 opcode=3 <label>
+134 m4 @0x00007f59f06963c8 opcode=394 f3 a6 rep cmps %ds:(%rsi)[1byte] %es:(%rdi)[1byte] %rsi %rdi %rcx -> %rsi %rdi %rcx
+136 m4 @0x00007f59f069a6e0 opcode=3 <label>
+136 m4 @0x00007f59f0697780 opcode=55 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+145 m4 @0x00007f59f069a190 opcode=55 65 48 8b 34 25 08 00 mov %gs:0x08[8byte] -> %rsi
00 00
+154 m4 @0x00007f59f0699db0 opcode=55 65 48 8b 3c 25 18 00 mov %gs:0x18[8byte] -> %rdi
00 00
+163 L4 @0x00007f59f0699f28 opcode=157 0f 85 15 6f 9e f0 jnz $0x00007f59e1093e1b
+169 L3 @0x00007f59f0698208 opcode=55 8b 44 85 18 mov 0x18(%rbp,%rax,4)[4byte] -> %eax
+173 L3 @0x00007f59f0697eb8 opcode=60 41 85 03 test (%r11)[4byte] %eax
+176 L3 @0x00007f59f0697b50 opcode=60 85 c0 test %eax %eax
+178 L4 @0x00007f59f0698df0 opcode=165 0f 8d b0 19 df fa jnl $0x00007f5beb49e8b6
+184 L4 @0x00007f59f0698108 opcode=46 e9 d5 19 df fa jmp $0x00007f5beb49e8da
END 0x00007f59e1093e1b
cache pc 0x00007f5beb49e7cd vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e7d6 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e7e0 vs 0x00007f5beb49e87a 2 0x0000000000000000
cache pc 0x00007f5beb49e7e2 vs 0x00007f5beb49e87a 3 0x0000000000000000
cache pc 0x00007f5beb49e7e5 vs 0x00007f5beb49e87a 2 0x0000000000000000
cache pc 0x00007f5beb49e7e7 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e7f0 vs 0x00007f5beb49e87a 5 0x0000000000000000
cache pc 0x00007f5beb49e7f5 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e7fe vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e807 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e811 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e81b vs 0x00007f5beb49e87a 1 0x0000000000000000
cache pc 0x00007f5beb49e81c vs 0x00007f5beb49e87a 6 0x0000000000000000
cache pc 0x00007f5beb49e822 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e82c vs 0x00007f5beb49e87a 3 0x0000000000000000
cache pc 0x00007f5beb49e82f vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e839 vs 0x00007f5beb49e87a 6 0x0000000000000000
cache pc 0x00007f5beb49e83f vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e849 vs 0x00007f5beb49e87a 10 0x0000000000000000
cache pc 0x00007f5beb49e853 vs 0x00007f5beb49e87a 2 0x0000000000000000
cache pc 0x00007f5beb49e855 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e85e vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e867 vs 0x00007f5beb49e87a 9 0x0000000000000000
cache pc 0x00007f5beb49e870 vs 0x00007f5beb49e87a 6 0x0000000000000000
cache pc 0x00007f5beb49e876 vs 0x00007f5beb49e87a 4 0x00007f5beb49e8cf
cache pc 0x00007f5beb49e87a vs 0x00007f5beb49e87a 3 0x00007f5beb49e8d3
2 recreate_app -- found valid state pc 0x00007f59e1093e1f
1 recreate_app -- found ok pc 0x00007f59e1093e1f
Kirill
Hi, @derekbruening. Currently we try to use tools. First of all we added default DynamoRIO tools but most of them had crashes. Let's look at instrace_simple (disable all fprintf at tool because they produce SIGBUS) with just clean java call without any workload.
.bin64/drrun -disable_traces -c ./api/bin/libinstrace_simple.so -- java -XX:+ShowMessageBoxOnError
crash
Unexpected Error
------------------------------------------------------------------------------
SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700
Do you want to debug the problem?
To debug, run 'gdb /proc/2092212/exe 2092212'; then switch to thread 2092231 (0x00007f5acd0ae700)
Enter 'yes' to launch gdb automatically (PATH must include gdb)
Otherwise, press RETURN to abort...
==============================================================================
gdb /proc/2092212/exe 20922122092231
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700
#
# JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0-internal-debug-root_2021_07_19_10_14-b00)
# Java VM: OpenJDK 64-Bit Server VM (25.71-b00-debug mixed mode linux-amd64 compressed oops)
# Problematic frame:
# 0x00007f5d3355dadc V [libjvm.so+0x7f2adc] CodeHeap::add_to_freelist(HeapBlock*)+0x1c
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/huawei/builds/cronbuild-8.0.18887/projects/hs_err_pid2092212.log
stack
(gdb) bt
#0 0x00007f5cf33cef68 in ?? ()
#1 0x00007f5acd0ac910 in ?? ()
#2 0x00007f5d00000000 in ?? ()
#3 0x00007f5acd0ac980 in ?? ()
#4 0x0000000000000010 in ?? ()
#5 0x00007f5acd0ac9b0 in ?? ()
#6 0x00007f5d33916b76 in os::message_box (title=0x7f5d33ea790b "Unexpected Error",
message=0x7f5d344be420 "SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700\n\nDo you want to debug the problem?\n\nTo debug, run 'gdb /proc/2092212/exe 2092212'; then switch to thread 2092231 (0x00007f5a"...) at /root/builds/kuhanov/openjdk8u/hotspot/src/os/linux/vm/os_linux.cpp:5516
#7 0x00007f5d33af1f65 in VMError::show_message_box (this=0x7f5acd0acbc0,
buf=0x7f5d344be420 "SIGSEGV (0xb) at pc=0x00007f5d3355dadc, pid=2092212, tid=0x00007f5acd0ae700\n\nDo you want to debug the problem?\n\nTo debug, run 'gdb /proc/2092212/exe 2092212'; then switch to thread 2092231 (0x00007f5a"..., buflen=2000) at /root/builds/kuhanov/openjdk8u/hotspot/src/os/linux/vm/vmError_linux.cpp:53
#8 0x00007f5d33af104c in VMError::report_and_die (this=0x7f5acd0acbc0) at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/utilities/vmError.cpp:955
#9 0x00007f5d3391bd59 in JVM_handle_linux_signal (sig=11, info=0x7f5acd0ace90, ucVoid=0x7f5acd0acd60, abort_if_unrecognized=1)
at /root/builds/kuhanov/openjdk8u/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:558
#10 0x00007f5d339146d9 in signalHandler (sig=11, info=0x7f5acd0ace90, uc=0x7f5acd0acd60) at /root/builds/kuhanov/openjdk8u/hotspot/src/os/linux/vm/os_linux.cpp:4588
#11 <signal handler called>
#12 0x00007f5d3355dadc in CodeHeap::add_to_freelist (this=0x71c5c56479c5fc64, a=0xc5d1df2941c4c7df)
at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp:363
#13 0x00007f5d3355d4eb in CodeHeap::deallocate (this=0x71c5c56479c5fc64, p=0xf9c5160c6ff9c506) at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp:240
registers context for frame #12
(gdb) f 12
#12 0x00007f5d3355dadc in CodeHeap::add_to_freelist (this=0x71c5c56479c5fc64, a=0xc5d1df2941c4c7df)
at /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp:363
363 /root/builds/kuhanov/openjdk8u/hotspot/src/share/vm/memory/heap.cpp: No such file or directory.
(gdb) i r
rax 0x71c5c56479c5fc64 8198175732028275812
rbx 0x7f5accf8e040 140027962646592
rcx 0x0 0
rdx 0xc5d1df2941c4c7df -4192324409815152673
rsi 0xc5d1df2941c4c7df -4192324409815152673
rdi 0x71c5c56479c5fc64 8198175732028275812
rbp 0x7f5acd0ad450 0x7f5acd0ad450
rsp 0x7f5acd0ad420 0x7f5acd0ad420
r8 0x7f5aec187000 140028484808704
r9 0x4 4
r10 0x0 0
r11 0x286 646
r12 0x1fecc7 2092231
r13 0x7f5d32d697cf 140038261610447
r14 0x7f5d32d698b0 140038261610672
r15 0x7f5acd0adfc0 140027963826112
rip 0x7f5d3355dadc 0x7f5d3355dadc <CodeHeap::add_to_freelist(HeapBlock*)+28>
eflags 0x10202 [ IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
Looks like DRIO restores registers incorrectly when it added instrumentation instructions to bb. CodeHeap::add_to_freelist (this=0x71c5c56479c5fc64, a=0xc5d1df2941c4c7df) rax 0x71c5c56479c5fc64
As we understand there is some lazy algorithm for saving and restoring registers that are used in instrumentaion instructions. What we do to check that - just save and restore registers always
diff --git a/ext/drreg/drreg.c b/ext/drreg/drreg.c
index a711cbea..ff12a95f 100644
--- a/ext/drreg/drreg.c
+++ b/ext/drreg/drreg.c
@@ -951,7 +951,7 @@ drreg_reserve_reg_internal(void *drcontext, instrlist_t *ilist, instr_t *where,
pt->reg[GPR_IDX(reg)].in_use = true;
if (!already_spilled) {
/* Even if dead now, we need to own a slot in case reserved past dead point */
- if (ops.conservative ||
+ if (true || ops.conservative ||
drvector_get_entry(&pt->reg[GPR_IDX(reg)].live, pt->live_idx) == REG_LIVE) {
LOG(drcontext, DR_LOG_ALL, 3, "%s @%d." PFX ": spilling %s to slot %d\n",
__FUNCTION__, pt->live_idx, get_where_app_pc(where),
@@ -1236,7 +1236,7 @@ drreg_unreserve_register(void *drcontext, instrlist_t *ilist, instr_t *where,
return DRREG_ERROR_INVALID_PARAMETER;
LOG(drcontext, DR_LOG_ALL, 3, "%s @%d." PFX " %s\n", __FUNCTION__, pt->live_idx,
get_where_app_pc(where), get_register_name(reg));
- if (drmgr_current_bb_phase(drcontext) != DRMGR_PHASE_INSERTION) {
+ if (true || drmgr_current_bb_phase(drcontext) != DRMGR_PHASE_INSERTION) {
/* We have no way to lazily restore. We do not bother at this point
* to try and eliminate back-to-back spill/restore pairs.
*/
So, the crashes were dissapeared and we could collect tool statistics. Could you look at this issue on your side (reproducer is very simple)? What could be missed and store/restore registers algorithm here? Thanks, Kirill
@sapostolakis has a tool that tries to systematically find register state errors such as from drreg that might be able to help here. Tracking these things down can be difficult. Here, one approach would be a binary search over blocks, turning the lazy restores on at N blocks and locating the problematic block that way, if the block sequence is deterministic.
I would first suspect a bad interaction with something unique to Java vs normal apps (since we're using drreg on very large x86 apps and this code is fairly well exercised on regular apps): selfmod sandboxing. I wonder if there's some register usage by the sandboxing mangling that breaks drreg. Does disabling that mangling also solve the problem?
Does disabling that mangling also solve the problem?
no, disabling sandboxing didn't help here - the same crash.
.bin64/drrun -disable_traces -sandbox2ro_threshold 0 -ro2sandbox_threshold 0 -c ./api/bin/libinstrace_simple.so -- java -XX:+ShowMessageBoxOnError
Kirill
-sandbox2ro_threshold 0 -ro2sandbox_threshold 0
doesn't disable all sandboxing. -no_sandbox_writes
partially does it; I think -no_hw_cache_consistency
might be the only way to completely disable -- at the risk of incorrect execution if there is truly modified code.
-sandbox2ro_threshold 0 -ro2sandbox_threshold 0
doesn't disable all sandboxing.-no_sandbox_writes
partially does it; I think-no_hw_cache_consistency
might be the only way to completely disable -- at the risk of incorrect execution if there is truly modified code.
-no_sandbox_writes
has the same issue
-no_hw_cache_consistency
without crash
Kirill
Hi, @derekbruening, @AssadHashmi, @fhahn. Currently we tried to run java workloads on Aarch64 like we have on x86 now. But we got hangs on heavy runs (HelloWorld is ok). What we could see in the debugger that all threads waits on futex and one thread with wake futex (count of wake threads is 0 and futex address doesn't present in wait threads. this is strange)
dump for threads: SYS_futex(0x62), uint32_t uaddr, int futex_op, uint32_t val,const struct timespec timeout..
pc x8 x0 x1 x2 x3
0xffff6da3d3b0 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6c70f948 0x62 0xfffd682abd8c 0x80 0x0 0x0
0xffff6c70f948 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6d8b53b0 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6d84d3b0 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6d8253b0 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6c70f948 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6c70f948 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6d7dd330 0x62 0xfffd6827ee88 0x80 0x0 0xfffc08e91088
0xffff6d785330 0x62 0xfffd6827c288 0x80 0x0 0xfffc09092108
0xffff6d68d3b0 0x62 0xfffd68279788 0x80 0x0 0xfffc09293188
0xffff6d675330 0x62 0xfffd68276c88 0x80 0x0 0xfffc09494208
0xffff6d65d328 0x62 0xffffb0476700 0x80 0x2 0x0
0xffff6d645330 0x62 0xfffd68271588 0x80 0x0 0xfffc09896308
0xffff6d62d330 0x62 0xfffd6826ea88 0x80 0x0 0xfffc09a97388
0xffff6d6153b0 0x62 0xfffd6826bf88 0x80 0x0 0xfffc09c98008
0xffff6d5fd330 0x62 0xfffd68261388 0x80 0x0 0xfffc09e99088
0xffff6d5e5330 0x62 0xfffd6825e888 0x80 0x0 0xfffc0a09a108
0xffff6d5cd330 0x62 0xfffd6825bd88 0x80 0x0 0xfffc0a29b188
0xffff6d585330 0x62 0xfffd68258b88 0x80 0x0 0xfffc0a49c208
0xffff6c70f948 0x62 0xffffb0452f80 0x189 0x0 0x0
0xffff6c70f948 0x62 0xfffd6820298c 0x80 0x0 0x0
0xffff6c70f948 0x62 0xfffd681fd18c 0x80 0x0 0x0
0xffff6d385330 0x62 0xfffd681eee88 0x80 0x0 0xfffc0ac9cf78
0xffff6c70f948 0x62 0xfffd680a6988 0x80 0x0 0x0
0xffff6c70f948 0x62 0xfffd680a4988 0x80 0x0 0x0
Do you have some ideas what could be wrong here? where could we investigate that in code? are this DR internals?
SIGUSR2 signal to the process resumes threads from futex. Thanks, Kirill
we are seeing that SPECjvm 2008 runs won't even start the warm-up phase when launched with drrun. Typically specjvm runs may look like this:
with drrun we never get to this first message. I do see two threads running for short period but not convinced runs is successful since it never gets to warm-up and execution phase of the test. Although memory utilization is roughly 11GB which is quite high for sparse.small
attached log debuglevel 3 for the java pid java.log.zip
java.0.59824.zip