DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.62k stars 557 forks source link

[jdk8] SPECjvm 2008 tests won't run #3733

Open rkgithubs opened 5 years ago

rkgithubs commented 5 years ago

we are seeing that SPECjvm 2008 runs won't even start the warm-up phase when launched with drrun. Typically specjvm runs may look like this:

/home/rahul/jdk1.8.0_201/bin/java -jar SPECjvm2008.jar -ikv -wt 15 -it 30 -bt 2 scimark.sparse.small

SPECjvm2008 Peak
  Properties file:   none
  Benchmarks:        scimark.sparse.small

with drrun we never get to this first message. I do see two threads running for short period but not convinced runs is successful since it never gets to warm-up and execution phase of the test. Although memory utilization is roughly 11GB which is quite high for sparse.small

/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/bin64/drrun -s 60 -debug -loglevel 3 -vm_size 1G -no_enable_reset -disable_traces -- ~/rahul/jdk1.8.0_201/bin/java -jar SPECjvm2008.jar -ikv -wt 15 -it 30 -bt 2 scimark.sparse.small

<log dir=/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/bin64/../logs/java.59563.00000000>

<Starting application /root/rahul/jdk1.8.0_201/bin/java (59563)>
<Initial options = -no_dynamic_options -loglevel 3 -code_api -stack_size 56K -signal_stack_size 32K -disable_traces -no_enable_traces -max_elide_jmp 0 -max_elide_call 0 -no_shared_traces -bb_ibl_targets -no_shared_trace_ibl_routine -no_enable_reset -no_reset_at_switch_to_os_at_vmm_limit -reset_at_vmm_percent_free_limit 0 -no_reset_at_vmm_full -reset_at_commit_free_limit 0K -reset_every_nth_pending 0 -vm_size 1048576K -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/lib64/debug/libdynamorio.so' 0x00007f2e11bd7580
>
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<(1+x) Handling our fault in a TRY at 0x00007f2e11e20d7c>
<spurious rep/repne prefix @0x00007f2e11994f96 (f2 41 ff e3): >
<writing to executable region.>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00000000491dc000-0x0000000089042000 prot=0x00000000
        cache says: 0x000000004904e000-0x0000000089042000 prot=0x00000000
>

attached log debuglevel 3 for the java pid java.log.zip

java.0.59824.zip

derekbruening commented 2 years ago

Please load DR symbols and get callstacks and see whether the waits are in DR code or not, and if in DR which lock(s) are involved.

kuhanov commented 2 years ago

lease load DR symb

We did it but stack is not resolved.

libdynamorio.so base is 0x71000000
Running add-symbol-file /home/test/drio/lib64/release/libdynamorio.so 0x71000000+0x00014510
add symbol table from file "/home/test/drio/lib64/release/libdynamorio.so" at
        .text_addr = 0x71014510
libjvm.so base is 0xffffaefd6000
Running add-symbol-file /home/openjdk8u/jvm/openjdk-1.8.0-internal-debug/jre/lib/aarch64/server/libjvm.so 0xffffaefd6000+0x0021f340
add symbol table from file "/home/openjdk8u/jvm/openjdk-1.8.0-internal-debug/jre/lib/aarch64/server/libjvm.so" at
        .text_addr = 0xffffaf1f5340
(gdb) x /i $pc 
=> 0xffff6c70f948:      svc     #0x0
(gdb) thread apply all bt 

Thread 89 (LWP 94519):
#0  0x0000ffff6da3d3b0 in ?? ()
#1  0x0000ffffaecfd6c8 in ?? ()
#2  0xf1f1f1f100000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 88 (LWP 94517):
#0  0x0000ffff6c70f948 in ?? ()
#1  0x0000000000017135 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 87 (LWP 94445):
#0  0x0000ffff6c70f948 in ?? ()
#1  0x0000000000000048 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
derekbruening commented 2 years ago

What is printing "libdynamorio.so base is 0x71000000"? Please check the procfs maps file and see where libdynamorio.so is really loaded: modern kernels don't normally load it at 0x71. If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from `heapmgt`).

kuhanov commented 2 years ago

What is printing "libdynamorio.so base is 0x71000000"?

this is just gdb python script that parses /proc/pid/maps

    def invoke(self, unused_arg, unused_from_tty):
        pid = int(gdb.selected_inferior().pid)
        exefile = "/proc/%d/exe" % pid
        drfile = str(Path(exefile).resolve())
        base = 0
        map_name = "/proc/%d/maps" % pid
        for MODULE in ['libdynamorio.so', 'libjvm.so']:
            with open(map_name,'r') as map:
                for line in map:
                    line = line.rstrip()
                    if not line: continue
                    match = re.match(r'^([^-]*)-.*' + MODULE, line)
                    if match:
                        base = int(match.group(1), 16)
                        print("%s base is 0x%x" % (MODULE, base))
                        drfile = line.split()[-1]
                        break
            p = subprocess.Popen(["objdump", "-h", drfile], stdout=subprocess.PIPE)
            stdout, _ = p.communicate()
            for line in iter(stdout.splitlines()):
                match = re.match(r'.*\.text *\w+ *\w+ *\w+ *(\w+)', str(line))
                if match:
                    offs = match.group(1)
                    cmd = "add-symbol-file %s 0x%x+0x%s" % (drfile, base, offs)
                    # First clear the symbols (sometimes gdb has DR syms at the wrong
                    # load address).
                    gdb.execute("symbol-file")
                    print("Running %s" % cmd)
                    gdb.execute(cmd)

Looks like something goes wrong in the script Kirill

kuhanov commented 2 years ago

@derekbruening . Looks like the script is correct. We dumped /proc/pid/maps for our java process

71000000-711a8000 r-xp 00000000 fd:00 5638685                            /home/test/lib64/release/libdynamorio.so
711a8000-711bb000 ---p 00000000 00:00 0
711bb000-711d0000 rw-p 001ab000 fd:00 5638685                            /home/test/lib64/release/libdynamorio.so

$pc and address from stack are outside any module

Thread 4 (LWP 17851):
=> 0xffff5c606948:      svc     #0x0
ffff5c606000-ffff5c607000 r-xp 00000000 00:00 0
gdb) bt
#0  0x0000ffff5c606948 in ?? ()
#1  0x0000ffff9ebf4338 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) memquery 0x0000ffff9ebf4338
ffff9e9f9000-ffff9ebf6000 rw-p 00000000 00:00 0

Kirill

kuhanov commented 2 years ago

Hi, @derekbruening. Some of our DynamoRIO tools produced issue with memory on heavy HiBench workloads

Out of memory.  Program aborted.  Source C, type 0x0000000000000001, code 0x000000000000000c.>

I've tired to use this set of options -vm_size 2G -vmheap_size 16G -opt_memory But looks like that vm is still not enough. Is there any way to increase vm else? Thx, Kirill.

derekbruening commented 2 years ago

@derekbruening . Looks like the script is correct. We dumped /proc/pid/maps for our java process

OK, different kernel making different mmap decisions. But the main question still remains:

If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from *heapmgt).

derekbruening commented 2 years ago

Out of memory. Program aborted. Source C, type 0x0000000000000001, code 0x000000000000000c.>

C==OOM_COMMIT, 1==VMM_HEAP, 0xc==ENOMEM.

So it is unrelated to the address space, so increasing the reservation sizes will not make a difference. Your kernel is refusing to give out another physical page. I would check your system's overcommit settings (/proc/sys/vm/overcommit_memory) and overall memory situation. I would use -rstats_to_stderr to see how much DR and your clients are using but it may not really be an issue with them at all: it might be the app allocating a ton of physical pages and not using them and your kernel being set to not allow overcommit.

kuhanov commented 2 years ago

If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from *heapmgt). How we could check that?

We run another collection with the same crash. $pc and address from stack are outside any module in /proc/pid/maps

Thread 4 (LWP 17851):
=> 0xffff5c606948:      svc     #0x0
ffff5c606000-ffff5c607000 r-xp 00000000 00:00 0
gdb) bt
#0  0x0000ffff5c606948 in ?? ()
#1  0x0000ffff9ebf4338 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Kirill

kuhanov commented 2 years ago

it might be the app allocating a ton of physical pages and not using them

java is ok without DynamoRIO and it is ok under DynamoRIO without tool or with another tools. So, this is strange that issue with application itself. Kirill

kuhanov commented 2 years ago

-rstats_to_stderr

DynamoRIO statistics:
              Peak threads under DynamoRIO control :                20
                              Threads ever created :                20
                           Total signals delivered :                 1
                                 System calls, pre :               766
                                System calls, post :               663
                                 Application mmaps :               205
                               Application munmaps :                33
        Synch attempt failure b/c not at safe spot :                65
                   Basic block fragments generated :            145043
             Peak fcache combined capacity (bytes) :          85139456
                    Peak fcache units on live list :              1495
                Peak special heap capacity (bytes) :           1716224
                      Peak heap units on live list :              1164
                         Peak client raw mmap size :        1315389440
                       Peak stack capacity (bytes) :           1859584
                        Peak heap capacity (bytes) :         395087872
                 Peak total memory from OS (bytes) :         513536000
              Peak vmm blocks for unreachable heap :             98698
                         Peak vmm blocks for stack :               536
      Peak vmm blocks for unreachable special heap :               120
      Peak vmm blocks for unreachable special mmap :               140
                Peak vmm blocks for reachable heap :               242
                         Peak vmm blocks for cache :             24027
        Peak vmm blocks for reachable special heap :               474
        Peak vmm blocks for reachable special mmap :               227
            Peak vmm virtual memory in use (bytes) :         509190144
derekbruening commented 2 years ago

If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from *heapmgt).

How we could check that?

I pointed at*heapmgt: print that in gdb and it will show the vmcode and vmheap memory ranges. But given that there is no libdynamorio code at these wait points it does not seem to involve DR's futexes.

derekbruening commented 2 years ago

Peak client raw mmap size : 1315389440

You can see your client is using more memory than DR. DR itself is less than half a GB. Is your system really not enabling overcommit? It seems strange to let committed and unused pages artifically use up all your memory -- why not enable overcommit? I thought that was the default nowadays on all Linux distros.

kuhanov commented 2 years ago

why not enable overcommit?

My system is ubuntu 20.04.2. I suppose overcommit is enabled by defualt.

cat /proc/sys/vm/overcommit_memory
0
cat /proc/sys/vm/overcommit_kbytes
0
cat /proc/sys/vm/overcommit_ratio
50

Kirill

kuhanov commented 2 years ago

Looks like tool occupied all RAM :( OOM error between 688 and 10741 free mb memory

Mem:          15847       13174         846          58        1825        2282
Swap:          2047          37        2010
              total        used        free      shared  buff/cache   available
Mem:          15847       13250         770          58        1825        2205
Swap:          2047          37        2010
              total        used        free      shared  buff/cache   available
Mem:          15847       13333         688          58        1825        2123
Swap:          2047          37        2010
              total        used        free      shared  buff/cache   available
Mem:          15847        3280       10741          58        1825       12176
Swap:          2047          37        2010

Kirill

RabbitPowerr commented 2 years ago

Peak client raw mmap size : 1315389440

You can see your client is using more memory than DR. DR itself is less than half a GB. Is your system really not enabling overcommit? It seems strange to let committed and unused pages artifically use up all your memory -- why not enable overcommit? I thought that was the default nowadays on all Linux distros.

Hi,@derekbruening. Dump of heapmgt:

(gdb) print *heapmgt
$1 = {vmheap = {start_addr = 0xfffd6dd91000 "", end_addr = 0xffff6dd91000 "", alloc_start = 0xfffd6dd91000 "", alloc_size = 8589938688, num_blocks = 2097152, lock = {lock_requests = -1, contended_event = -1, 
      name = 0x7116ddb0 "vmh_lock(mutex)@/home/burdukov/dynamorio_fake_tag/core/heap.c:951", rank = 76, owner = 0, owning_dcontext = 0x0, prev_owned_lock = 0x0, count_times_acquired = 1175, count_times_contended = 0, 
      count_times_spin_pause = 0, max_contended_requests = 0, count_times_spin_only = 0, prev_process_lock = 0x711c9cb0 <request_region_be_heap_reachable_lock>, next_process_lock = 0x711c9d28 <heap_unit_lock>, callstack = {0x0, 
        0x0, 0x0, 0x0}, app_lock = false, deleted = false}, num_free_blocks = 2074474, name = 0x7116dee8 "vmheap", blocks = 0xfffd6dd91000}, vmcode = {start_addr = 0xffff6dd92000 "", end_addr = 0xffffadd92000 "\177ELF\002\001\001", 
    alloc_start = 0xffff6dd92000 "", alloc_size = 1073741824, num_blocks = 262144, lock = {lock_requests = -1, contended_event = -1, name = 0x7116ddb0 "vmh_lock(mutex)@/home/burdukov/dynamorio_fake_tag/core/heap.c:951", rank = 76, 
      owner = 0, owning_dcontext = 0x0, prev_owned_lock = 0x0, count_times_acquired = 534, count_times_contended = 0, count_times_spin_pause = 0, max_contended_requests = 0, count_times_spin_only = 0, 
      prev_process_lock = 0x711c9380 <options_lock>, next_process_lock = 0x711c9cb0 <request_region_be_heap_reachable_lock>, callstack = {0x0, 0x0, 0x0, 0x0}, app_lock = false, deleted = false}, num_free_blocks = 254385, 
    name = 0x7116ddf8 "vmcode", blocks = 0xffff6dd92000}, dual_map_file = 0, vmcode_writable_base = 0x0, vmcode_writable_alloc = 0x0, heap = {units = 0xfffd732d6000, dead = 0x0, num_dead = 0}, global_units = {
    top_unit = 0xfffd6ddd2000, cur_unit = 0xfffd6ffa4000, free_list = {0xfffd6de1d448 "@\324\341m\375\377", 0xfffd6ffdf0a8 "", 0xfffd6ffd9650 "\020\226\375o\375\377", 0xfffd6ddd38b8 "\340 \337m\375\377", 0x0, 0x0, 0x0, 0x0, 0x0, 
      0xfffd6ef45dc0 "\030\201\364n\375\377", 0xfffd6ffd00c8 "`]\375o\375\377"}, dcontext = 0xffffffffffffffff, which = VMM_HEAP, writable = true}, global_nonpersistent_units = {top_unit = 0xfffd6dde2000, cur_unit = 0xfffd7335a000, 
    free_list = {0x0, 0xfffd6fbb22b8 "", 0xfffd73344cc0 "", 0xfffd7336f5e8 "P\366\066s\375\377", 0xfffd72ee7648 "\250\255\356r\375\377", 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffd6e2f4048 "\320t\360m\375\377"}, dcontext = 0xffffffffffffffff, 
    which = VMM_HEAP, writable = true}, global_heap_writable = true, global_unprotected_units = {top_unit = 0xfffd6ddda000, cur_unit = 0xfffd6ddda000, free_list = {0x0, 0xfffd6ddda040 "", 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
      0x0}, dcontext = 0xffffffffffffffff, which = VMM_HEAP, writable = true}, global_reachable_units = {top_unit = 0xffff6dd9b000, cur_unit = 0xffff6dd9b000, free_list = {0x0 <repeats 11 times>}, dcontext = 0xffffffffffffffff, 
    which = (VMM_HEAP | VMM_REACHABLE), writable = true}}

Where we should look more closely? What info can be useful for us?

As well we see trouble with mangling. Can it cause our deadlock?

unsupported mangle instr  movk   %x30 $0xffff lsl $0x20 -> %x30
recreate_app -- invalid state: unsup=1 in-mangle=1 xl8=0x0000ffffb7dc02e0 walk=0x0000ffffb7dc02e0
derekbruening commented 2 years ago

Looks like tool occupied all RAM :( OOM error between 688 and 10741 free mb memory

The rstat you showed had only 1.3GB of tool raw mmap. Not sure what other memory your tool is using.

Dump of heapmgt: Where we should look more closely? What info can be useful for us?

This was regarding the hang callstacks, just to double-check whether the PC values where all inside vmcode (code cache/reachable heap) or somewhere else. For better callstacks: gdb has trouble sometimes with an initial syscall frame under DR but that can be worked around; xref https://dynamorio.org/page_debugging.html#autotoc_md146.

On AArch64 it is possible that there are weak memory order issues bugs remaining somewhere or code update issues on some hardware: xref #2502, #1911.

derekbruening commented 2 years ago

As well we see trouble with mangling. Can it cause our deadlock? unsupported mangle instr movk %x30 $0xffff lsl $0x20 -> %x30 recreate_app -- invalid state: unsup=1 in-mangle=1 xl8=0x0000ffffb7dc02e0 walk=0x0000ffffb7dc02e0

This could cause a translation problem and subsequent corruption of app state, so it could technically cause any kind of app misbehavior. Best to fix it, yes.

RabbitPowerr commented 2 years ago

Hi, @derekbruening ! We investigate deadlocks on AARCH64 +Java and see that :

#0  0x0000000071197824 in dynamorio_syscall ()
#1  0x0000000071171b24 in os_thread_yield () at /home/burdukov/dynamorio_fake_tag/core/unix/os.c:3448
#2  0x000000007117a2f0 in rwlock_wait_contended_writer (rwlock=0xfffd528fa1d4) at /home/burdukov/dynamorio_fake_tag/core/unix/os.c:9765
#3  0x000000007104dc20 in d_r_write_lock (rw=0xfffd528fa1d4) at /home/burdukov/dynamorio_fake_tag/core/utils.c:1249
#4  0x000000007106d988 in executable_areas_lock () at /home/burdukov/dynamorio_fake_tag/core/vmareas.c:3853
#5  0x000000007102cc78 in flush_fragments_in_region_start (dcontext=0xfffd5527f080, base=0x0, size=0, own_initexit_lock=false, free_futures=false, exec_invalid=false, force_synchall=true)
    at /home/burdukov/dynamorio_fake_tag/core/fragment.c:6816
#6  0x000000007102d110 in flush_fragments_from_region (dcontext=0xfffd5527f080, base=0x0, size=0, force_synchall=true, flush_completion_callback=0x0, user_data=0x0) at /home/burdukov/dynamorio_fake_tag/core/fragment.c:6868
#7  0x0000000071047b4c in dispatch_enter_dynamorio (dcontext=0xfffd5527f080) at /home/burdukov/dynamorio_fake_tag/core/dispatch.c:938
#8  0x000000007104539c in d_r_dispatch (dcontext=0xfffd5527f080) at /home/burdukov/dynamorio_fake_tag/core/dispatch.c:161
#9  0x0000ffff967693fc in ?? ()

The first strange thing that base=0x0, size=0. That is okey? Why we do flush_fragments_from_region only for AARCH64? And where do these addresses come from because we can't understand this ; app_pc begin = (app_pc)dcontext->local_state->spill_space.r2;?

Second, it looks like that in executable_area_overlap_bounds should be d_r_read_unclock if we can't find first overlapping region:

bool
executable_area_overlap_bounds(app_pc start, app_pc end, app_pc *overlap_start /*OUT*/,
                               app_pc *overlap_end /*OUT*/, uint frag_flags, bool contig)
{
    int start_index, end_index; /* must be signed */
    int i;                      /* must be signed */
    ASSERT(overlap_start != NULL && overlap_end != NULL);
    d_r_read_lock(&executable_areas->lock);

    /* Find first overlapping region */
    if (!binary_search(executable_areas, start, end, NULL, &start_index, true /*first*/))
        return false;
kuhanov commented 2 years ago

Hi @derekbruening Could you share https://dynamorio.org/dynamorio_docs/using.html#sec_64bit_reach page? Looks like it is not avalable now. We have the same issue when vmcode_heap is fully occupied and OOM error is happened. Why is 2G limitation here? is this system limitation? I run 64bit processes on 64-bit machine Thanks, Kirill

derekbruening commented 2 years ago

Remove the dynamorio_docs: https://dynamorio.org/using.html#sec_64bit_reach Where did you find the old URL? vmcode has a 32-bit-offset-reachable model. But vmheap has no limit. Are you sure it's out of vmcode address space? It is unlikely and we have never seen it before: that would be an awful lot of application code. We have seen machines run out of physical pages.

kuhanov commented 2 years ago

Remove the dynamorio_docs: https://dynamorio.org/using.html#sec_64bit_reach Where did you find the old URL? vmcode has a 32-bit-offset-reachable model. But vmheap has no limit. Are you sure it's out of vmcode address space? It is unlikely and we have never seen it before: that would be an awful lot of application code. We have seen machines run out of physical pages.

Where did you find the old URL? in google groups https://groups.google.com/g/dynamorio-users/c/xLudSjVFEcQ/m/2nZHrMBvBQAJ

kuhanov commented 2 years ago

Are you sure it's out of vmcode address space?

yes, I dump log for vmm_heap_commit function in heap.c res = os_heap_commit(p, size, prot, error_code); if (!res) { dr_fprintf(STDERR, "MY_OOM %d %s\n", res, vmh->name); }

vmh->name was 'vmcode' Kirill

derekbruening commented 2 years ago

Are you sure it's out of vmcode address space?

yes, I dump log for vmm_heap_commit function in heap.c res = os_heap_commit(p, size, prot, error_code); if (!res) { dr_fprintf(STDERR, "MY_OOM %d %s\n", res, vmh->name); }

No, that is not address space: os_heap_commit fails when there are no physical pages (it is used after space has already been reserved earlier). Sounds like this is not related to the 2GB space limit but is the same type of thing that is discussed above about overcommit: https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-937917653

kuhanov commented 2 years ago

I added info about free blocks on OOM issue dr_fprintf(STDERR, "ERROR vmm_heap_commit %s: size=%d p=" PFX " prot=%x blocks=%d free_blocks=%d\n", vmh->name,size, p, prot, vmh->num_blocks, vmh->num_free_blocks);

ERROR vmm_heap_commit vmcode: size=4096 p=0x0000000059d5a000 prot=7 blocks=524288 free_blocks=451433 mprotect_syscall return -12 Out of memory in extend - still alive after emergency free.<Application /home/spark/usr/local/jdk/bin/java.or (217650). Out of memory. Program aborted. Source C, type 0x0000000000000022, code 0x000000000000000c.>

"free_blocks=451433" is about 1.7 GB. Looks like all vmcode_heap is free (just 300 Mb was occupied) BUT we have OOM error Kirill

kuhanov commented 2 years ago

Sounds like this is not related to the 2GB space limit but is the same type of thing that is discussed above about overcommit: #3733 (comment)

Do you mean we allocated 2 GB in heap init and use 300 Mb but than we call mprotect_syscall the kernel does not allow us to get allocated block? Something like this? Kirill

derekbruening commented 2 years ago

Sounds like this is not related to the 2GB space limit but is the same type of thing that is discussed above about overcommit: #3733 (comment)

Do you mean we allocated 2 GB in heap init and use 300 Mb but than we call mprotect_syscall the kernel does not allow us to get allocated block? Something like this? Kirill

Yes, we "reserve" 2GB of space as no-access contiguous pages, and "commit" by mprotect-ing individual pages incrementally to be accessible and writable. See the OOM comment above where mprotect returns ENOMEM. This seems like an issue with the kernel and the system rather than DR. In the past this was seen on systems where overcommit was disabled.

kuhanov commented 2 years ago

This seems like an issue with the kernel and the system rather than DR.

my value

cat /proc/sys/vm/overcommit_memory
0

Do you suggest to set 1? Thx, Kirill

derekbruening commented 2 years ago

Could try 1. Should check memory usage on the machine. Maybe it doesn't have enough swap; maybe w/o DR it's using nearly everything.

kuhanov commented 2 years ago

Could try 1. Should check memory usage on the machine. Maybe it doesn't have enough swap; maybe w/o DR it's using nearly everything.

Hi, @derekbruening . I've cought the system issue. The address was in the heap range. Swap was not occupied at all.

My java spent limit for memory map areas a process may have.

https://www.kernel.org/doc/Documentation/sysctl/vm.txt

max_map_count:

This file contains the maximum number of memory map areas a process
may have. Memory map areas are used as a side-effect of calling
malloc, directly by mmap, mprotect, and madvise, and also when loading
shared libraries.

While most applications need less than a thousand maps, certain
programs, particularly malloc debuggers, may consume lots of them,
e.g., up to one or two maps per allocation.

The default value is 65536.

I've increased this value to 10 times and OOM issue was disappeared.

sysctl -w vm.max_map_count=655300

Probably it will be good to add such case to some DRIO documentation or change OOM message. :)

Thanks, Kirill

derekbruening commented 2 years ago
I've increased this value to 10 times and OOM issue was disappeared.

sysctl -w vm.max_map_count=655300



Probably it will be good to add such case to some DRIO documentation or change OOM message. :)

Interesting: glad you figured it out. If you have ideas on where to add such information, it would be great if you submitted a pull request.

kuhanov commented 2 years ago

Interesting: glad you figured it out. If you have ideas on where to add such information, it would be great if you submitted a pull request.

In general, I suggest something simple like more information error message with tips

diff --git a/core/win32/events.mc b/core/win32/events.mc
index 5b5a9e33..a1d8ddd5 100644
--- a/core/win32/events.mc
+++ b/core/win32/events.mc
@@ -256,7 +256,7 @@ Severity = Error
 Facility = DRCore
 SymbolicName = MSG_OUT_OF_MEMORY
 Language=English
-Application %1!s! (%2!s!).  Out of memory.  Program aborted.  Source %3!s!, type %3!s!, code %4!s!.
+Application %1!s! (%2!s!).  Out of memory.  Program aborted.  Source %3!s!, type %3!s!, code %4!s!. Consider raising -vm_size, -vmheap_size or kernel parameter vm.max_map_count - the number of mmap() ranges.
 .

 MessageId =
kuhanov commented 2 years ago

Hi @derekbruening. I try to use the same runs for AArch64 now. And I have strange behaviour that we didn't see on x86. There are some conditions when all java threads (service and workload) hang on futex syscall. We check all stacks for each thread. In the hang case all threas are in native jvm code (not dynamic) and call some synchronisation (pthread_join, pthread_mutex_lock, pthread_cond_wait) setbt64 dumps stacks and all these threads could be in this state in that they are. BUT they did it in the same time :( One example for the stange thing is pthread_mutex_lock could not lock free mutex.

register dump where x8 = 0x62 means futex syscall number

 tid   pc                         x8             x0               x1               x2            x3
--------------------------------------------------------------------------------------------------------
19677  0xffff132e1948            0x62  0xfff90c1ef188            0x80             0x0  0xffff93b884c8
19676  0xffff132e1948            0x62  0xfff90c1c938c            0x80             0x0             0x0
19637  0xffff132e1948            0x62  0xfff90c1bdc8c            0x80             0x0  0xffff93f89208
19634  0xffff132e1948            0x62  0xfff90c1bb188            0x80             0x0  0xffff9418a288
19633  0xffff132e1948            0x62  0xffff97019f80           0x189             0x0             0x0
19501  0xffff132e1948            0x62  0xfff90c16408c            0x80             0x0             0x0
19495  0xffff132e1948            0x62  0xfff90c15e88c            0x80             0x0             0x0
19431  0xffff132e1948            0x62  0xfff90c150688            0x80             0x0  0xffff9498aff8
19161  0xffff132e1948            0x62  0xfff90c00ce88            0x80             0x0             0x0
19102  0xffff132e1948            0x62  0xffff958c52b0             0x0          0x4ad9             0x0

$pc for all threads

(gdb) thread apply all x/i $pc

Thread 10 (LWP 19677):
=> 0xffff132e1948:      svc     #0x0

Thread 9 (LWP 19676):
=> 0xffff132e1948:      svc     #0x0

Thread 8 (LWP 19637):
=> 0xffff132e1948:      svc     #0x0

Thread 7 (LWP 19634):
=> 0xffff132e1948:      svc     #0x0

Thread 6 (LWP 19633):
=> 0xffff132e1948:      svc     #0x0

Thread 5 (LWP 19501):
=> 0xffff132e1948:      svc     #0x0

Thread 4 (LWP 19495):
=> 0xffff132e1948:      svc     #0x0

Thread 3 (LWP 19431):
=> 0xffff132e1948:      svc     #0x0

Thread 2 (LWP 19161):
=> 0xffff132e1948:      svc     #0x0

Thread 1 (LWP 19102):
=> 0xffff132e1948:      svc     #0x0

0xffff132e1948 is do_syscall in some emit code.

Maybe you have some ideas here? What limited logs could we collect? (need to decrease the size because big overhead on java workloads) Is it possible to see full stack with dynamorio frames to see fulle picture? currently setbt64 has just jvm frames.

It will be great to have some ideas, suggestions, help here. We spend a lot of time here but no catching the resaon of such case.

Also if we disable jitting or try to reproduce on static binaries, all works fine and there are no hangs here.

For tiny java worloads we have about 1 fail on 100 runs but if we run something like specjvm we jave hang on each run.

Thanks, Kirill

derekbruening commented 2 years ago

0xffff132e1948 is do_syscall in some emit code.

DR doesn't consider SYS_futex to need intercepting so we would expect it to be inlined inside blocks in the code cache, rather than handled in do_syscall, so that is one odd thing. Debug logs would say whether it failed to find the number or not; is the number moved into x8 in a completely separate prior block or something and DR doesn't see it?

Is it possible to see full stack with dynamorio frames to see fulle picture? currently setbt64 has just jvm frames.

If the thread is running app code there will not be any DR frames: DR has an empty callstack while in the code cache.

So no thread is in DR code and this is purely app code? So this is DR messing up the app's locks somehow?

Exclusive monitors come to mind: the -ldstex2cas feature, or what it's trying to fix, is one possible cause. See https://dynamorio.org/page_ldstex.html
Xref #5245, #5247: try -no_ldstex2cas?

Xref #4279

kuhanov commented 2 years ago

DR doesn't consider SYS_futex to need intercepting so we would expect it to be inlined inside blocks in the code cache, rather than handled in do_syscall, so that is one odd thing. Debug logs would say whether it failed to find the number or not; is the number moved into x8 in a completely separate prior block or something and DR doesn't see it?

Probably it was experiment with '-no_ignore_syscalls' enabled option

kuhanov commented 2 years ago

Xref #5245, #5247: try -no_ldstex2cas?

Looks like that helps us. We check a set of tiny workloads. We are going to run on heavy benchmark and share result here Thanks, Kirill

derekbruening commented 2 years ago

Xref #5245, #5247: try -no_ldstex2cas?

Looks like that helps us. We check a set of tiny workloads. We are going to run on heavy benchmark and share result here

If that makes a big difference, we need to understand why. Is it A) emulating with compare-and-swap fundamentally breaks the synchronization used in the JVM or is it B) the conversion is fine in theory but there is a bug in handling some particular sequence (such as #5245). Naturally A is much harder to deal with.

I would suggest a binary search over which blocks use -ldstex2cas to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which -ldstex2cas was no longer applied for easier binary searching from the command line.

kuhanov commented 2 years ago

Hi, @derekbruening. Continue our experiments on AArch64. We saw strange thing on io workloads. They take awful count of time. I compare the same run on x86 and it is fine.

ReadWrite.java the simplest one thread java application (MyFile.txt is text 350 MB file)

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.FileReader;
import java.io.IOException;

public class ReadWrite {
.
    public static void main(String[] args) {
        try {
            FileReader reader = new FileReader("MyFile.txt");
            FileWriter writer = new FileWriter("MyFile2.txt", true);

            BufferedReader bufferedReader = new BufferedReader(reader);
            BufferedWriter bufferedWriter = new BufferedWriter(writer);

            String line;
            while ((line = bufferedReader.readLine()) != null) {
                bufferedWriter.write(line);
                bufferedWriter.newLine();
            }
            reader.close();
            writer.close();

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

Original run without DynamoRIO took ~2 secs. x86 with DynamoRIO took about 3 secs BUT AArch64 with DRIO took 14 minutes (420x overhead) :((

perf hotspots looks image

opcodes statisticks

Top 15 opcode execution counts in 64-bit AArch64 mode:
   90371625 : movz
  109692827 : cbz
  119677065 : ldrb
  141852747 : ldp
  149865875 : stp
  150764846 : sub
  186207835 : ubfm
  257289620 : str
  311057147 : orr
  372462682 : ldrsb
  373357880 : strh
  379415317 : strb
  736110803 : ldrh
  760824735 : ldr
  2145521407 : subs
  2182566199 : bcond
  2473296771 : add

Also I've tried to check c++ native version on AArch64 and it looks fine

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
  string line;
  ifstream myfile ("MyFile.txt");
  ofstream myfile2 ("MyFile2.txt");
  if (myfile.is_open() && myfile2.is_open())
  {
    while ( getline (myfile,line) )
    {
      myfile2 <<  line << '\n';
    }
    myfile.close();
    myfile2.close();
  }

  else cout << "Unable to open file";

  return 0;
}

opcodes statisticks

Top 15 opcode execution counts in 64-bit AArch64 mode:
   38881409 : sub
   41657277 : bl
   43253273 : cbz
   44942292 : cbnz
   48018122 : br
   54446025 : ret
   63293933 : ldur
   64014752 : movz
   64051241 : adrp
   99217398 : stp
  124038866 : bcond
  127033323 : subs
  127577743 : ldp
  148219600 : str
  211563739 : add
  222393285 : orr
  353538698 : ldr

Maybe you have ideas what could be wrong here? Why do we have such difference between x86 and AArch64 on such java workloads? Thanks, Kirill

derekbruening commented 2 years ago

What was the verdict on -no_ldstex2cas?

derekbruening commented 2 years ago

I would suggest filing a separate issue on the aarch64 slowness on the small Java program with the info in https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-1024102586. I think in general not much tuning has been done on that plaform. Xref #4279.

kuhanov commented 2 years ago

-no_ldstex2cas

It was ok on small workloads but we have crashes on heavy ones. Currently we run collection with no tools.

To have more stable results we added one more option to skip splitting such regions - -unsafe_build_ldstex So, using -unsafe_build_ldstex -no_ldstex2cas allowed us to have succes runs. I've checked on 2 the simples HiBench benchmarks, they are finished succefully (just big overhead on io operation. they took 2 hours against 2.5 mnutes on x86)

We tried to implement tool that calculated how much instructions we skip in this case. This is less than 1% from all count of instructions. Kirill

kuhanov commented 2 years ago

I would suggest filing a separate issue on the aarch64 slowness on the small Java program with the info in #3733 (comment). I think in general not much tuning has been done on that plaform. Xref #4279.

Moved to separate issue

5309

Thx, Kirill

derekbruening commented 2 years ago

-no_ldstex2cas

It was ok on small workloads but we have crashes on heavy ones. Currently we run collection with no tools.

If possible it would be great to figure out what the problems are with -ldstex2cas since that is the only good solution we have right now to this problem. As mentioned in https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-1016995607 if there is an app where -no_ldstex2cas is better than -ldstex2cas, would be it possible to drill down on the exact application code sequence causing the problem? Is it a bug in the -ldstex2cas transformation, or something more fundamental like CAS not being sufficient?

To have more stable results we added one more option to skip splitting such regions - -unsafe_build_ldstex So, using -unsafe_build_ldstex -no_ldstex2cas allowed us to have succes runs. I've checked on 2 the simples HiBench benchmarks, they are finished succefully (just big overhead on io operation. they took 2 hours against 2.5 mnutes on x86)

This does not make sense to me: if there's no tool and it's plain DR, -unsafe_build_ldstex should not provide any significant benefit. The only loads/stores in between a ldex and stex in separate blocks would be on the very first execution while it builds the second block, so on the 2nd iteration of the app's loop it should succeed. Unless there's an indirect branch in between them and it loads from DR's indirect branch hashtable which seems quite unlikely. It would be good to understand this. @AssadHashmi any ideas here?

derekbruening commented 2 years ago

To have more stable results we added one more option to skip splitting such regions - -unsafe_build_ldstex So, using -unsafe_build_ldstex -no_ldstex2cas allowed us to have succes runs. I've checked on 2 the simples HiBench benchmarks, they are finished succefully (just big overhead on io operation. they took 2 hours against 2.5 mnutes on x86)

This does not make sense to me: if there's no tool and it's plain DR, -unsafe_build_ldstex should not provide any significant benefit. The only loads/stores in between a ldex and stex in separate blocks would be on the very first execution while it builds the second block, so on the 2nd iteration of the app's loop it should succeed. Unless there's an indirect branch in between them and it loads from DR's indirect branch hashtable which seems quite unlikely. It would be good to understand this. @AssadHashmi any ideas here?

@abhinav92003 points out that to link far-away blocks beyond the reach of AArch64 direct branches, DR is using a stp and ldr to link through the stub: https://dynamorio.org/page_aarch64_far.html This is likely the explanation, that for apps with enough code that far links are needed there is actually memory traffic between a ldex and stex in separate blocks.

kuhanov commented 2 years ago

If possible it would be great to figure out what the problems are with -ldstex2cas since that is the only good solution we have right now to this problem. As mentioned in #3733 (comment) if there is an app where -no_ldstex2cas is better than -ldstex2cas, would be it possible to drill down on the exact application code sequence causing the problem? Is it a bug in the -ldstex2cas transformation, or something more fundamental like CAS not being sufficient?

Hi. What could be the way for that? just logs? Is it possible to enable logs for release build? Also we could share small reproducer like we have for IO java workload overhead. Thx, Kirill

derekbruening commented 2 years ago

Already in https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-1016995607:

I would suggest a binary search over which blocks use -ldstex2cas to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which -ldstex2cas was no longer applied for easier binary searching from the command line.

E.g., I had -os_aslr_version pick at which global counter added and incremented in mangle_exclusive_monitor_op the code just returned from that function up front, and iterated on that value.

Or since this issue might be relevant to our work we might be able to make time to try a reproducer if you have one.

kuhanov commented 2 years ago

Already in #3733 (comment):

I would suggest a binary search over which blocks use -ldstex2cas to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which -ldstex2cas was no longer applied for easier binary searching from the command line.

E.g., I had -os_aslr_version pick at which global counter added and incremented in mangle_exclusive_monitor_op the code just returned from that function up front, and iterated on that value.

Or since this issue might be relevant to our work we might be able to make time to try a reproducer if you have one.

Hi, @derekbruening. Sorry, I'm a little confused here :( Is there option to limit the size of exclusive region for mangling? Something like drrun -ldstex2cas -ldstex2cas_size=5 -- /bin/ls? What is the option name? in this case I set region mangling limit to 5 and stay anothers (>5) as is Is it correct? I could not see limitation checks inside mangle_exclusive_monitor_op. Thanks, Kirill

derekbruening commented 2 years ago

Already in #3733 (comment):

I would suggest a binary search over which blocks use -ldstex2cas to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which -ldstex2cas was no longer applied for easier binary searching from the command line.

E.g., I had -os_aslr_version pick at which global counter added and incremented in mangle_exclusive_monitor_op the code just returned from that function up front, and iterated on that value. Or since this issue might be relevant to our work we might be able to make time to try a reproducer if you have one.

Hi, @derekbruening. Sorry, I'm a little confused here :( Is there option to limit the size of exclusive region for mangling? Something like drrun -ldstex2cas -ldstex2cas_size=5 -- /bin/ls? What is the option name? in this case I set region mangling limit to 5 and stay anothers (>5) as is Is it correct? I could not see limitation checks inside mangle_exclusive_monitor_op.

I hacked it: I modified the DR source code, co-opting an existing option that isn't used for anything on Linux: -os_aslr_version. I added code at the top of mangle_exclusive_monitor_op, something like:

static int count;
if (count++ > DYNAMO_OPTION(os_aslr_version)) return NULL;

I think I also dumped the ilist (e.g., instrlist_disassemble(dcontext, NULL, ilist, STDERR)) for the block at the threshold of the bug (or every block that reaches there: there probably aren't that many) to easily view the culprit ldex/stex code.

kuhanov commented 2 years ago

Hi. If I add such conditions to mangle_exclusive_monitor_op, I have hangs on the simple java run. all threads on the futex again

    static int count = 0;
    if (count > DYNAMO_OPTION(os_aslr_version) {
        instrlist_disassemble(dcontext, instr, ilist, STDERR);
        return NULL;
    }
    if (instr_is_exclusive_store(instr)) count++;

Looks like I have a lot of exclusive load/store regions with branches inside

TAG  0x0000fff90ebe6d18
 +0    L3 @0x0000fff90ebe6b58  f9472a80   ldr    +0x0e50(%x20)[8byte] -> %x0
 +4    L3 @0x0000fff90ebf0d80  52800021   movz   $0x0001 lsl $0x00 -> %w1
 +8    L3 @0x0000fff90ebe6d18  885ffc02   ldaxr  (%x0)[4byte] -> %w2
 +12   L3 @0x0000fff90ebe6858  7100005f   subs   %w2 $0x0000 lsl $0x00 -> %wzr
 +16   L3 @0x0000fff90ebe89d8  54000061   b.ne   $0x0000ffff92783b10
 +20   L4 @0x0000fff90ebe7e98  14000000   b      $0x0000ffff92783b08
END 0x0000fff90ebe6d18

So, I mangle store in one bb and don't mangle load in another. Is there a way to disable splitting monitor regions? Kirill

AssadHashmi commented 2 years ago

branch in between them and it loads from DR's indirect branch hashtable which seems quite unlikely. It would be good to understand this. @AssadHashmi any ideas here?

@abhinav92003 points out that to link far-away blocks beyond the reach of AArch64 direct branches, DR is using a stp and ldr to link through the stub: https://dynamorio.org/page_aarch64_far.html This is likely the explanation, that for apps with enough code that far links are needed there is actually memory traffic between a ldex and stex in separate blocks.

Sorry, this got filtered out! My (late) response is to confirm @abhinav92003's explanation that it's probably due to generating code for far-link memory accesses between load/store exclusives.