Open rkgithubs opened 5 years ago
Please load DR symbols and get callstacks and see whether the waits are in DR code or not, and if in DR which lock(s) are involved.
lease load DR symb
We did it but stack is not resolved.
libdynamorio.so base is 0x71000000
Running add-symbol-file /home/test/drio/lib64/release/libdynamorio.so 0x71000000+0x00014510
add symbol table from file "/home/test/drio/lib64/release/libdynamorio.so" at
.text_addr = 0x71014510
libjvm.so base is 0xffffaefd6000
Running add-symbol-file /home/openjdk8u/jvm/openjdk-1.8.0-internal-debug/jre/lib/aarch64/server/libjvm.so 0xffffaefd6000+0x0021f340
add symbol table from file "/home/openjdk8u/jvm/openjdk-1.8.0-internal-debug/jre/lib/aarch64/server/libjvm.so" at
.text_addr = 0xffffaf1f5340
(gdb) x /i $pc
=> 0xffff6c70f948: svc #0x0
(gdb) thread apply all bt
Thread 89 (LWP 94519):
#0 0x0000ffff6da3d3b0 in ?? ()
#1 0x0000ffffaecfd6c8 in ?? ()
#2 0xf1f1f1f100000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 88 (LWP 94517):
#0 0x0000ffff6c70f948 in ?? ()
#1 0x0000000000017135 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 87 (LWP 94445):
#0 0x0000ffff6c70f948 in ?? ()
#1 0x0000000000000048 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
What is printing "libdynamorio.so base is 0x71000000"? Please check the procfs maps file and see where libdynamorio.so is really loaded: modern kernels don't normally load it at 0x71. If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from `heapmgt`).
What is printing "libdynamorio.so base is 0x71000000"?
this is just gdb python script that parses /proc/pid/maps
def invoke(self, unused_arg, unused_from_tty):
pid = int(gdb.selected_inferior().pid)
exefile = "/proc/%d/exe" % pid
drfile = str(Path(exefile).resolve())
base = 0
map_name = "/proc/%d/maps" % pid
for MODULE in ['libdynamorio.so', 'libjvm.so']:
with open(map_name,'r') as map:
for line in map:
line = line.rstrip()
if not line: continue
match = re.match(r'^([^-]*)-.*' + MODULE, line)
if match:
base = int(match.group(1), 16)
print("%s base is 0x%x" % (MODULE, base))
drfile = line.split()[-1]
break
p = subprocess.Popen(["objdump", "-h", drfile], stdout=subprocess.PIPE)
stdout, _ = p.communicate()
for line in iter(stdout.splitlines()):
match = re.match(r'.*\.text *\w+ *\w+ *\w+ *(\w+)', str(line))
if match:
offs = match.group(1)
cmd = "add-symbol-file %s 0x%x+0x%s" % (drfile, base, offs)
# First clear the symbols (sometimes gdb has DR syms at the wrong
# load address).
gdb.execute("symbol-file")
print("Running %s" % cmd)
gdb.execute(cmd)
Looks like something goes wrong in the script Kirill
@derekbruening . Looks like the script is correct. We dumped /proc/pid/maps for our java process
71000000-711a8000 r-xp 00000000 fd:00 5638685 /home/test/lib64/release/libdynamorio.so
711a8000-711bb000 ---p 00000000 00:00 0
711bb000-711d0000 rw-p 001ab000 fd:00 5638685 /home/test/lib64/release/libdynamorio.so
$pc and address from stack are outside any module
Thread 4 (LWP 17851):
=> 0xffff5c606948: svc #0x0
ffff5c606000-ffff5c607000 r-xp 00000000 00:00 0
gdb) bt
#0 0x0000ffff5c606948 in ?? ()
#1 0x0000ffff9ebf4338 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) memquery 0x0000ffff9ebf4338
ffff9e9f9000-ffff9ebf6000 rw-p 00000000 00:00 0
Kirill
Hi, @derekbruening. Some of our DynamoRIO tools produced issue with memory on heavy HiBench workloads
Out of memory. Program aborted. Source C, type 0x0000000000000001, code 0x000000000000000c.>
I've tired to use this set of options -vm_size 2G -vmheap_size 16G -opt_memory But looks like that vm is still not enough. Is there any way to increase vm else? Thx, Kirill.
@derekbruening . Looks like the script is correct. We dumped /proc/pid/maps for our java process
OK, different kernel making different mmap decisions. But the main question still remains:
If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from *heapmgt).
Out of memory. Program aborted. Source C, type 0x0000000000000001, code 0x000000000000000c.>
C==OOM_COMMIT, 1==VMM_HEAP, 0xc==ENOMEM.
So it is unrelated to the address space, so increasing the reservation sizes will not make a difference. Your kernel is refusing to give out another physical page. I would check your system's overcommit settings (/proc/sys/vm/overcommit_memory) and overall memory situation. I would use -rstats_to_stderr to see how much DR and your clients are using but it may not really be an issue with them at all: it might be the app allocating a ton of physical pages and not using them and your kernel being set to not allow overcommit.
If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from *heapmgt). How we could check that?
We run another collection with the same crash. $pc and address from stack are outside any module in /proc/pid/maps
Thread 4 (LWP 17851):
=> 0xffff5c606948: svc #0x0
ffff5c606000-ffff5c607000 r-xp 00000000 00:00 0
gdb) bt
#0 0x0000ffff5c606948 in ?? ()
#1 0x0000ffff9ebf4338 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Kirill
it might be the app allocating a ton of physical pages and not using them
java is ok without DynamoRIO and it is ok under DynamoRIO without tool or with another tools. So, this is strange that issue with application itself. Kirill
-rstats_to_stderr
DynamoRIO statistics:
Peak threads under DynamoRIO control : 20
Threads ever created : 20
Total signals delivered : 1
System calls, pre : 766
System calls, post : 663
Application mmaps : 205
Application munmaps : 33
Synch attempt failure b/c not at safe spot : 65
Basic block fragments generated : 145043
Peak fcache combined capacity (bytes) : 85139456
Peak fcache units on live list : 1495
Peak special heap capacity (bytes) : 1716224
Peak heap units on live list : 1164
Peak client raw mmap size : 1315389440
Peak stack capacity (bytes) : 1859584
Peak heap capacity (bytes) : 395087872
Peak total memory from OS (bytes) : 513536000
Peak vmm blocks for unreachable heap : 98698
Peak vmm blocks for stack : 536
Peak vmm blocks for unreachable special heap : 120
Peak vmm blocks for unreachable special mmap : 140
Peak vmm blocks for reachable heap : 242
Peak vmm blocks for cache : 24027
Peak vmm blocks for reachable special heap : 474
Peak vmm blocks for reachable special mmap : 227
Peak vmm virtual memory in use (bytes) : 509190144
If libdynamorio.so is not encompassing 0xffff6c70f948 print the maps entry for that address: what is it? Is it in DR's allocated memory (from *heapmgt).
How we could check that?
I pointed at*heapmgt
: print that in gdb and it will show the vmcode and vmheap memory ranges. But given that there is no libdynamorio code at these wait points it does not seem to involve DR's futexes.
Peak client raw mmap size : 1315389440
You can see your client is using more memory than DR. DR itself is less than half a GB. Is your system really not enabling overcommit? It seems strange to let committed and unused pages artifically use up all your memory -- why not enable overcommit? I thought that was the default nowadays on all Linux distros.
why not enable overcommit?
My system is ubuntu 20.04.2. I suppose overcommit is enabled by defualt.
cat /proc/sys/vm/overcommit_memory
0
cat /proc/sys/vm/overcommit_kbytes
0
cat /proc/sys/vm/overcommit_ratio
50
Kirill
Looks like tool occupied all RAM :( OOM error between 688 and 10741 free mb memory
Mem: 15847 13174 846 58 1825 2282
Swap: 2047 37 2010
total used free shared buff/cache available
Mem: 15847 13250 770 58 1825 2205
Swap: 2047 37 2010
total used free shared buff/cache available
Mem: 15847 13333 688 58 1825 2123
Swap: 2047 37 2010
total used free shared buff/cache available
Mem: 15847 3280 10741 58 1825 12176
Swap: 2047 37 2010
Kirill
Peak client raw mmap size : 1315389440
You can see your client is using more memory than DR. DR itself is less than half a GB. Is your system really not enabling overcommit? It seems strange to let committed and unused pages artifically use up all your memory -- why not enable overcommit? I thought that was the default nowadays on all Linux distros.
Hi,@derekbruening. Dump of heapmgt:
(gdb) print *heapmgt
$1 = {vmheap = {start_addr = 0xfffd6dd91000 "", end_addr = 0xffff6dd91000 "", alloc_start = 0xfffd6dd91000 "", alloc_size = 8589938688, num_blocks = 2097152, lock = {lock_requests = -1, contended_event = -1,
name = 0x7116ddb0 "vmh_lock(mutex)@/home/burdukov/dynamorio_fake_tag/core/heap.c:951", rank = 76, owner = 0, owning_dcontext = 0x0, prev_owned_lock = 0x0, count_times_acquired = 1175, count_times_contended = 0,
count_times_spin_pause = 0, max_contended_requests = 0, count_times_spin_only = 0, prev_process_lock = 0x711c9cb0 <request_region_be_heap_reachable_lock>, next_process_lock = 0x711c9d28 <heap_unit_lock>, callstack = {0x0,
0x0, 0x0, 0x0}, app_lock = false, deleted = false}, num_free_blocks = 2074474, name = 0x7116dee8 "vmheap", blocks = 0xfffd6dd91000}, vmcode = {start_addr = 0xffff6dd92000 "", end_addr = 0xffffadd92000 "\177ELF\002\001\001",
alloc_start = 0xffff6dd92000 "", alloc_size = 1073741824, num_blocks = 262144, lock = {lock_requests = -1, contended_event = -1, name = 0x7116ddb0 "vmh_lock(mutex)@/home/burdukov/dynamorio_fake_tag/core/heap.c:951", rank = 76,
owner = 0, owning_dcontext = 0x0, prev_owned_lock = 0x0, count_times_acquired = 534, count_times_contended = 0, count_times_spin_pause = 0, max_contended_requests = 0, count_times_spin_only = 0,
prev_process_lock = 0x711c9380 <options_lock>, next_process_lock = 0x711c9cb0 <request_region_be_heap_reachable_lock>, callstack = {0x0, 0x0, 0x0, 0x0}, app_lock = false, deleted = false}, num_free_blocks = 254385,
name = 0x7116ddf8 "vmcode", blocks = 0xffff6dd92000}, dual_map_file = 0, vmcode_writable_base = 0x0, vmcode_writable_alloc = 0x0, heap = {units = 0xfffd732d6000, dead = 0x0, num_dead = 0}, global_units = {
top_unit = 0xfffd6ddd2000, cur_unit = 0xfffd6ffa4000, free_list = {0xfffd6de1d448 "@\324\341m\375\377", 0xfffd6ffdf0a8 "", 0xfffd6ffd9650 "\020\226\375o\375\377", 0xfffd6ddd38b8 "\340 \337m\375\377", 0x0, 0x0, 0x0, 0x0, 0x0,
0xfffd6ef45dc0 "\030\201\364n\375\377", 0xfffd6ffd00c8 "`]\375o\375\377"}, dcontext = 0xffffffffffffffff, which = VMM_HEAP, writable = true}, global_nonpersistent_units = {top_unit = 0xfffd6dde2000, cur_unit = 0xfffd7335a000,
free_list = {0x0, 0xfffd6fbb22b8 "", 0xfffd73344cc0 "", 0xfffd7336f5e8 "P\366\066s\375\377", 0xfffd72ee7648 "\250\255\356r\375\377", 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffd6e2f4048 "\320t\360m\375\377"}, dcontext = 0xffffffffffffffff,
which = VMM_HEAP, writable = true}, global_heap_writable = true, global_unprotected_units = {top_unit = 0xfffd6ddda000, cur_unit = 0xfffd6ddda000, free_list = {0x0, 0xfffd6ddda040 "", 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0}, dcontext = 0xffffffffffffffff, which = VMM_HEAP, writable = true}, global_reachable_units = {top_unit = 0xffff6dd9b000, cur_unit = 0xffff6dd9b000, free_list = {0x0 <repeats 11 times>}, dcontext = 0xffffffffffffffff,
which = (VMM_HEAP | VMM_REACHABLE), writable = true}}
Where we should look more closely? What info can be useful for us?
As well we see trouble with mangling. Can it cause our deadlock?
unsupported mangle instr movk %x30 $0xffff lsl $0x20 -> %x30
recreate_app -- invalid state: unsup=1 in-mangle=1 xl8=0x0000ffffb7dc02e0 walk=0x0000ffffb7dc02e0
Looks like tool occupied all RAM :( OOM error between 688 and 10741 free mb memory
The rstat you showed had only 1.3GB of tool raw mmap. Not sure what other memory your tool is using.
Dump of heapmgt: Where we should look more closely? What info can be useful for us?
This was regarding the hang callstacks, just to double-check whether the PC values where all inside vmcode (code cache/reachable heap) or somewhere else. For better callstacks: gdb has trouble sometimes with an initial syscall frame under DR but that can be worked around; xref https://dynamorio.org/page_debugging.html#autotoc_md146.
On AArch64 it is possible that there are weak memory order issues bugs remaining somewhere or code update issues on some hardware: xref #2502, #1911.
As well we see trouble with mangling. Can it cause our deadlock? unsupported mangle instr movk %x30 $0xffff lsl $0x20 -> %x30 recreate_app -- invalid state: unsup=1 in-mangle=1 xl8=0x0000ffffb7dc02e0 walk=0x0000ffffb7dc02e0
This could cause a translation problem and subsequent corruption of app state, so it could technically cause any kind of app misbehavior. Best to fix it, yes.
Hi, @derekbruening ! We investigate deadlocks on AARCH64 +Java and see that :
#0 0x0000000071197824 in dynamorio_syscall ()
#1 0x0000000071171b24 in os_thread_yield () at /home/burdukov/dynamorio_fake_tag/core/unix/os.c:3448
#2 0x000000007117a2f0 in rwlock_wait_contended_writer (rwlock=0xfffd528fa1d4) at /home/burdukov/dynamorio_fake_tag/core/unix/os.c:9765
#3 0x000000007104dc20 in d_r_write_lock (rw=0xfffd528fa1d4) at /home/burdukov/dynamorio_fake_tag/core/utils.c:1249
#4 0x000000007106d988 in executable_areas_lock () at /home/burdukov/dynamorio_fake_tag/core/vmareas.c:3853
#5 0x000000007102cc78 in flush_fragments_in_region_start (dcontext=0xfffd5527f080, base=0x0, size=0, own_initexit_lock=false, free_futures=false, exec_invalid=false, force_synchall=true)
at /home/burdukov/dynamorio_fake_tag/core/fragment.c:6816
#6 0x000000007102d110 in flush_fragments_from_region (dcontext=0xfffd5527f080, base=0x0, size=0, force_synchall=true, flush_completion_callback=0x0, user_data=0x0) at /home/burdukov/dynamorio_fake_tag/core/fragment.c:6868
#7 0x0000000071047b4c in dispatch_enter_dynamorio (dcontext=0xfffd5527f080) at /home/burdukov/dynamorio_fake_tag/core/dispatch.c:938
#8 0x000000007104539c in d_r_dispatch (dcontext=0xfffd5527f080) at /home/burdukov/dynamorio_fake_tag/core/dispatch.c:161
#9 0x0000ffff967693fc in ?? ()
The first strange thing that base=0x0, size=0
. That is okey? Why we do flush_fragments_from_region
only for AARCH64? And where do these addresses come from because we can't understand this ; app_pc begin = (app_pc)dcontext->local_state->spill_space.r2;
?
Second, it looks like that in executable_area_overlap_bounds
should be d_r_read_unclock
if we can't find first overlapping region:
bool
executable_area_overlap_bounds(app_pc start, app_pc end, app_pc *overlap_start /*OUT*/,
app_pc *overlap_end /*OUT*/, uint frag_flags, bool contig)
{
int start_index, end_index; /* must be signed */
int i; /* must be signed */
ASSERT(overlap_start != NULL && overlap_end != NULL);
d_r_read_lock(&executable_areas->lock);
/* Find first overlapping region */
if (!binary_search(executable_areas, start, end, NULL, &start_index, true /*first*/))
return false;
Hi @derekbruening Could you share https://dynamorio.org/dynamorio_docs/using.html#sec_64bit_reach page? Looks like it is not avalable now. We have the same issue when vmcode_heap is fully occupied and OOM error is happened. Why is 2G limitation here? is this system limitation? I run 64bit processes on 64-bit machine Thanks, Kirill
Remove the dynamorio_docs: https://dynamorio.org/using.html#sec_64bit_reach Where did you find the old URL? vmcode has a 32-bit-offset-reachable model. But vmheap has no limit. Are you sure it's out of vmcode address space? It is unlikely and we have never seen it before: that would be an awful lot of application code. We have seen machines run out of physical pages.
Remove the dynamorio_docs: https://dynamorio.org/using.html#sec_64bit_reach Where did you find the old URL? vmcode has a 32-bit-offset-reachable model. But vmheap has no limit. Are you sure it's out of vmcode address space? It is unlikely and we have never seen it before: that would be an awful lot of application code. We have seen machines run out of physical pages.
Where did you find the old URL? in google groups https://groups.google.com/g/dynamorio-users/c/xLudSjVFEcQ/m/2nZHrMBvBQAJ
Are you sure it's out of vmcode address space?
yes, I dump log for vmm_heap_commit function in heap.c res = os_heap_commit(p, size, prot, error_code); if (!res) { dr_fprintf(STDERR, "MY_OOM %d %s\n", res, vmh->name); }
vmh->name was 'vmcode' Kirill
Are you sure it's out of vmcode address space?
yes, I dump log for vmm_heap_commit function in heap.c res = os_heap_commit(p, size, prot, error_code); if (!res) { dr_fprintf(STDERR, "MY_OOM %d %s\n", res, vmh->name); }
No, that is not address space: os_heap_commit fails when there are no physical pages (it is used after space has already been reserved earlier). Sounds like this is not related to the 2GB space limit but is the same type of thing that is discussed above about overcommit: https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-937917653
I added info about free blocks on OOM issue dr_fprintf(STDERR, "ERROR vmm_heap_commit %s: size=%d p=" PFX " prot=%x blocks=%d free_blocks=%d\n", vmh->name,size, p, prot, vmh->num_blocks, vmh->num_free_blocks);
ERROR vmm_heap_commit vmcode: size=4096 p=0x0000000059d5a000 prot=7 blocks=524288 free_blocks=451433 mprotect_syscall return -12 Out of memory in extend - still alive after emergency free.<Application /home/spark/usr/local/jdk/bin/java.or (217650). Out of memory. Program aborted. Source C, type 0x0000000000000022, code 0x000000000000000c.>
"free_blocks=451433" is about 1.7 GB. Looks like all vmcode_heap is free (just 300 Mb was occupied) BUT we have OOM error Kirill
Sounds like this is not related to the 2GB space limit but is the same type of thing that is discussed above about overcommit: #3733 (comment)
Do you mean we allocated 2 GB in heap init and use 300 Mb but than we call mprotect_syscall the kernel does not allow us to get allocated block? Something like this? Kirill
Sounds like this is not related to the 2GB space limit but is the same type of thing that is discussed above about overcommit: #3733 (comment)
Do you mean we allocated 2 GB in heap init and use 300 Mb but than we call mprotect_syscall the kernel does not allow us to get allocated block? Something like this? Kirill
Yes, we "reserve" 2GB of space as no-access contiguous pages, and "commit" by mprotect-ing individual pages incrementally to be accessible and writable. See the OOM comment above where mprotect returns ENOMEM. This seems like an issue with the kernel and the system rather than DR. In the past this was seen on systems where overcommit was disabled.
This seems like an issue with the kernel and the system rather than DR.
my value
cat /proc/sys/vm/overcommit_memory
0
Do you suggest to set 1? Thx, Kirill
Could try 1. Should check memory usage on the machine. Maybe it doesn't have enough swap; maybe w/o DR it's using nearly everything.
Could try 1. Should check memory usage on the machine. Maybe it doesn't have enough swap; maybe w/o DR it's using nearly everything.
Hi, @derekbruening . I've cought the system issue. The address was in the heap range. Swap was not occupied at all.
My java spent limit for memory map areas a process may have.
https://www.kernel.org/doc/Documentation/sysctl/vm.txt
max_map_count:
This file contains the maximum number of memory map areas a process
may have. Memory map areas are used as a side-effect of calling
malloc, directly by mmap, mprotect, and madvise, and also when loading
shared libraries.
While most applications need less than a thousand maps, certain
programs, particularly malloc debuggers, may consume lots of them,
e.g., up to one or two maps per allocation.
The default value is 65536.
I've increased this value to 10 times and OOM issue was disappeared.
sysctl -w vm.max_map_count=655300
Probably it will be good to add such case to some DRIO documentation or change OOM message. :)
Thanks, Kirill
I've increased this value to 10 times and OOM issue was disappeared.
sysctl -w vm.max_map_count=655300
Probably it will be good to add such case to some DRIO documentation or change OOM message. :)
Interesting: glad you figured it out. If you have ideas on where to add such information, it would be great if you submitted a pull request.
Interesting: glad you figured it out. If you have ideas on where to add such information, it would be great if you submitted a pull request.
In general, I suggest something simple like more information error message with tips
diff --git a/core/win32/events.mc b/core/win32/events.mc
index 5b5a9e33..a1d8ddd5 100644
--- a/core/win32/events.mc
+++ b/core/win32/events.mc
@@ -256,7 +256,7 @@ Severity = Error
Facility = DRCore
SymbolicName = MSG_OUT_OF_MEMORY
Language=English
-Application %1!s! (%2!s!). Out of memory. Program aborted. Source %3!s!, type %3!s!, code %4!s!.
+Application %1!s! (%2!s!). Out of memory. Program aborted. Source %3!s!, type %3!s!, code %4!s!. Consider raising -vm_size, -vmheap_size or kernel parameter vm.max_map_count - the number of mmap() ranges.
.
MessageId =
Hi @derekbruening. I try to use the same runs for AArch64 now. And I have strange behaviour that we didn't see on x86. There are some conditions when all java threads (service and workload) hang on futex syscall. We check all stacks for each thread. In the hang case all threas are in native jvm code (not dynamic) and call some synchronisation (pthread_join, pthread_mutex_lock, pthread_cond_wait) setbt64 dumps stacks and all these threads could be in this state in that they are. BUT they did it in the same time :( One example for the stange thing is pthread_mutex_lock could not lock free mutex.
register dump where x8 = 0x62 means futex syscall number
tid pc x8 x0 x1 x2 x3
--------------------------------------------------------------------------------------------------------
19677 0xffff132e1948 0x62 0xfff90c1ef188 0x80 0x0 0xffff93b884c8
19676 0xffff132e1948 0x62 0xfff90c1c938c 0x80 0x0 0x0
19637 0xffff132e1948 0x62 0xfff90c1bdc8c 0x80 0x0 0xffff93f89208
19634 0xffff132e1948 0x62 0xfff90c1bb188 0x80 0x0 0xffff9418a288
19633 0xffff132e1948 0x62 0xffff97019f80 0x189 0x0 0x0
19501 0xffff132e1948 0x62 0xfff90c16408c 0x80 0x0 0x0
19495 0xffff132e1948 0x62 0xfff90c15e88c 0x80 0x0 0x0
19431 0xffff132e1948 0x62 0xfff90c150688 0x80 0x0 0xffff9498aff8
19161 0xffff132e1948 0x62 0xfff90c00ce88 0x80 0x0 0x0
19102 0xffff132e1948 0x62 0xffff958c52b0 0x0 0x4ad9 0x0
$pc for all threads
(gdb) thread apply all x/i $pc
Thread 10 (LWP 19677):
=> 0xffff132e1948: svc #0x0
Thread 9 (LWP 19676):
=> 0xffff132e1948: svc #0x0
Thread 8 (LWP 19637):
=> 0xffff132e1948: svc #0x0
Thread 7 (LWP 19634):
=> 0xffff132e1948: svc #0x0
Thread 6 (LWP 19633):
=> 0xffff132e1948: svc #0x0
Thread 5 (LWP 19501):
=> 0xffff132e1948: svc #0x0
Thread 4 (LWP 19495):
=> 0xffff132e1948: svc #0x0
Thread 3 (LWP 19431):
=> 0xffff132e1948: svc #0x0
Thread 2 (LWP 19161):
=> 0xffff132e1948: svc #0x0
Thread 1 (LWP 19102):
=> 0xffff132e1948: svc #0x0
0xffff132e1948 is do_syscall in some emit code.
Maybe you have some ideas here? What limited logs could we collect? (need to decrease the size because big overhead on java workloads) Is it possible to see full stack with dynamorio frames to see fulle picture? currently setbt64 has just jvm frames.
It will be great to have some ideas, suggestions, help here. We spend a lot of time here but no catching the resaon of such case.
Also if we disable jitting or try to reproduce on static binaries, all works fine and there are no hangs here.
For tiny java worloads we have about 1 fail on 100 runs but if we run something like specjvm we jave hang on each run.
Thanks, Kirill
0xffff132e1948 is do_syscall in some emit code.
DR doesn't consider SYS_futex to need intercepting so we would expect it to be inlined inside blocks in the code cache, rather than handled in do_syscall, so that is one odd thing. Debug logs would say whether it failed to find the number or not; is the number moved into x8 in a completely separate prior block or something and DR doesn't see it?
Is it possible to see full stack with dynamorio frames to see fulle picture? currently setbt64 has just jvm frames.
If the thread is running app code there will not be any DR frames: DR has an empty callstack while in the code cache.
So no thread is in DR code and this is purely app code? So this is DR messing up the app's locks somehow?
Exclusive monitors come to mind: the -ldstex2cas feature, or what it's trying to fix, is one possible cause.
See https://dynamorio.org/page_ldstex.html
Xref #5245, #5247: try -no_ldstex2cas
?
Xref #4279
DR doesn't consider SYS_futex to need intercepting so we would expect it to be inlined inside blocks in the code cache, rather than handled in do_syscall, so that is one odd thing. Debug logs would say whether it failed to find the number or not; is the number moved into x8 in a completely separate prior block or something and DR doesn't see it?
Probably it was experiment with '-no_ignore_syscalls' enabled option
Xref #5245, #5247: try
-no_ldstex2cas
?
Looks like that helps us. We check a set of tiny workloads. We are going to run on heavy benchmark and share result here Thanks, Kirill
Xref #5245, #5247: try
-no_ldstex2cas
?Looks like that helps us. We check a set of tiny workloads. We are going to run on heavy benchmark and share result here
If that makes a big difference, we need to understand why. Is it A) emulating with compare-and-swap fundamentally breaks the synchronization used in the JVM or is it B) the conversion is fine in theory but there is a bug in handling some particular sequence (such as #5245). Naturally A is much harder to deal with.
I would suggest a binary search over which blocks use -ldstex2cas
to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which -ldstex2cas
was no longer applied for easier binary searching from the command line.
Hi, @derekbruening. Continue our experiments on AArch64. We saw strange thing on io workloads. They take awful count of time. I compare the same run on x86 and it is fine.
ReadWrite.java the simplest one thread java application (MyFile.txt is text 350 MB file)
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.FileReader;
import java.io.IOException;
public class ReadWrite {
.
public static void main(String[] args) {
try {
FileReader reader = new FileReader("MyFile.txt");
FileWriter writer = new FileWriter("MyFile2.txt", true);
BufferedReader bufferedReader = new BufferedReader(reader);
BufferedWriter bufferedWriter = new BufferedWriter(writer);
String line;
while ((line = bufferedReader.readLine()) != null) {
bufferedWriter.write(line);
bufferedWriter.newLine();
}
reader.close();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Original run without DynamoRIO took ~2 secs. x86 with DynamoRIO took about 3 secs BUT AArch64 with DRIO took 14 minutes (420x overhead) :((
perf hotspots looks
opcodes statisticks
Top 15 opcode execution counts in 64-bit AArch64 mode:
90371625 : movz
109692827 : cbz
119677065 : ldrb
141852747 : ldp
149865875 : stp
150764846 : sub
186207835 : ubfm
257289620 : str
311057147 : orr
372462682 : ldrsb
373357880 : strh
379415317 : strb
736110803 : ldrh
760824735 : ldr
2145521407 : subs
2182566199 : bcond
2473296771 : add
Also I've tried to check c++ native version on AArch64 and it looks fine
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
string line;
ifstream myfile ("MyFile.txt");
ofstream myfile2 ("MyFile2.txt");
if (myfile.is_open() && myfile2.is_open())
{
while ( getline (myfile,line) )
{
myfile2 << line << '\n';
}
myfile.close();
myfile2.close();
}
else cout << "Unable to open file";
return 0;
}
opcodes statisticks
Top 15 opcode execution counts in 64-bit AArch64 mode:
38881409 : sub
41657277 : bl
43253273 : cbz
44942292 : cbnz
48018122 : br
54446025 : ret
63293933 : ldur
64014752 : movz
64051241 : adrp
99217398 : stp
124038866 : bcond
127033323 : subs
127577743 : ldp
148219600 : str
211563739 : add
222393285 : orr
353538698 : ldr
Maybe you have ideas what could be wrong here? Why do we have such difference between x86 and AArch64 on such java workloads? Thanks, Kirill
What was the verdict on -no_ldstex2cas
?
I would suggest filing a separate issue on the aarch64 slowness on the small Java program with the info in https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-1024102586. I think in general not much tuning has been done on that plaform. Xref #4279.
-no_ldstex2cas
It was ok on small workloads but we have crashes on heavy ones. Currently we run collection with no tools.
To have more stable results we added one more option to skip splitting such regions - -unsafe_build_ldstex
So, using -unsafe_build_ldstex -no_ldstex2cas
allowed us to have succes runs.
I've checked on 2 the simples HiBench benchmarks, they are finished succefully (just big overhead on io operation. they took 2 hours against 2.5 mnutes on x86)
We tried to implement tool that calculated how much instructions we skip in this case. This is less than 1% from all count of instructions. Kirill
I would suggest filing a separate issue on the aarch64 slowness on the small Java program with the info in #3733 (comment). I think in general not much tuning has been done on that plaform. Xref #4279.
Moved to separate issue
Thx, Kirill
-no_ldstex2cas
It was ok on small workloads but we have crashes on heavy ones. Currently we run collection with no tools.
If possible it would be great to figure out what the problems are with -ldstex2cas
since that is the only good solution we have right now to this problem. As mentioned in https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-1016995607 if there is an app where -no_ldstex2cas
is better than -ldstex2cas
, would be it possible to drill down on the exact application code sequence causing the problem? Is it a bug in the -ldstex2cas
transformation, or something more fundamental like CAS not being sufficient?
To have more stable results we added one more option to skip splitting such regions -
-unsafe_build_ldstex
So, using-unsafe_build_ldstex -no_ldstex2cas
allowed us to have succes runs. I've checked on 2 the simples HiBench benchmarks, they are finished succefully (just big overhead on io operation. they took 2 hours against 2.5 mnutes on x86)
This does not make sense to me: if there's no tool and it's plain DR, -unsafe_build_ldstex
should not provide any significant benefit. The only loads/stores in between a ldex and stex in separate blocks would be on the very first execution while it builds the second block, so on the 2nd iteration of the app's loop it should succeed. Unless there's an indirect branch in between them and it loads from DR's indirect branch hashtable which seems quite unlikely. It would be good to understand this. @AssadHashmi any ideas here?
To have more stable results we added one more option to skip splitting such regions -
-unsafe_build_ldstex
So, using-unsafe_build_ldstex -no_ldstex2cas
allowed us to have succes runs. I've checked on 2 the simples HiBench benchmarks, they are finished succefully (just big overhead on io operation. they took 2 hours against 2.5 mnutes on x86)This does not make sense to me: if there's no tool and it's plain DR,
-unsafe_build_ldstex
should not provide any significant benefit. The only loads/stores in between a ldex and stex in separate blocks would be on the very first execution while it builds the second block, so on the 2nd iteration of the app's loop it should succeed. Unless there's an indirect branch in between them and it loads from DR's indirect branch hashtable which seems quite unlikely. It would be good to understand this. @AssadHashmi any ideas here?
@abhinav92003 points out that to link far-away blocks beyond the reach of AArch64 direct branches, DR is using a stp
and ldr
to link through the stub: https://dynamorio.org/page_aarch64_far.html This is likely the explanation, that for apps with enough code that far links are needed there is actually memory traffic between a ldex and stex in separate blocks.
If possible it would be great to figure out what the problems are with
-ldstex2cas
since that is the only good solution we have right now to this problem. As mentioned in #3733 (comment) if there is an app where-no_ldstex2cas
is better than-ldstex2cas
, would be it possible to drill down on the exact application code sequence causing the problem? Is it a bug in the-ldstex2cas
transformation, or something more fundamental like CAS not being sufficient?
Hi. What could be the way for that? just logs? Is it possible to enable logs for release build? Also we could share small reproducer like we have for IO java workload overhead. Thx, Kirill
Already in https://github.com/DynamoRIO/dynamorio/issues/3733#issuecomment-1016995607:
I would suggest a binary search over which blocks use
-ldstex2cas
to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which-ldstex2cas
was no longer applied for easier binary searching from the command line.
E.g., I had -os_aslr_version
pick at which global counter added and incremented in mangle_exclusive_monitor_op
the code just returned from that function up front, and iterated on that value.
Or since this issue might be relevant to our work we might be able to make time to try a reproducer if you have one.
Already in #3733 (comment):
I would suggest a binary search over which blocks use
-ldstex2cas
to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which-ldstex2cas
was no longer applied for easier binary searching from the command line.E.g., I had
-os_aslr_version
pick at which global counter added and incremented inmangle_exclusive_monitor_op
the code just returned from that function up front, and iterated on that value.Or since this issue might be relevant to our work we might be able to make time to try a reproducer if you have one.
Hi, @derekbruening.
Sorry, I'm a little confused here :(
Is there option to limit the size of exclusive region for mangling?
Something like drrun -ldstex2cas -ldstex2cas_size=5 -- /bin/ls
? What is the option name?
in this case I set region mangling limit to 5 and stay anothers (>5) as is
Is it correct?
I could not see limitation checks inside mangle_exclusive_monitor_op.
Thanks, Kirill
Already in #3733 (comment):
I would suggest a binary search over which blocks use
-ldstex2cas
to zero in on the app code where it matters (this is how #5245 was identified). IIRC I modified DR so a runtime option selected which block count at which-ldstex2cas
was no longer applied for easier binary searching from the command line.E.g., I had
-os_aslr_version
pick at which global counter added and incremented inmangle_exclusive_monitor_op
the code just returned from that function up front, and iterated on that value. Or since this issue might be relevant to our work we might be able to make time to try a reproducer if you have one.Hi, @derekbruening. Sorry, I'm a little confused here :( Is there option to limit the size of exclusive region for mangling? Something like
drrun -ldstex2cas -ldstex2cas_size=5 -- /bin/ls
? What is the option name? in this case I set region mangling limit to 5 and stay anothers (>5) as is Is it correct? I could not see limitation checks inside mangle_exclusive_monitor_op.
I hacked it: I modified the DR source code, co-opting an existing option that isn't used for anything on Linux: -os_aslr_version
. I added code at the top of mangle_exclusive_monitor_op
, something like:
static int count;
if (count++ > DYNAMO_OPTION(os_aslr_version)) return NULL;
I think I also dumped the ilist
(e.g., instrlist_disassemble(dcontext, NULL, ilist, STDERR)
) for the block at the threshold of the bug (or every block that reaches there: there probably aren't that many) to easily view the culprit ldex/stex code.
Hi. If I add such conditions to mangle_exclusive_monitor_op, I have hangs on the simple java run. all threads on the futex again
static int count = 0;
if (count > DYNAMO_OPTION(os_aslr_version) {
instrlist_disassemble(dcontext, instr, ilist, STDERR);
return NULL;
}
if (instr_is_exclusive_store(instr)) count++;
Looks like I have a lot of exclusive load/store regions with branches inside
TAG 0x0000fff90ebe6d18
+0 L3 @0x0000fff90ebe6b58 f9472a80 ldr +0x0e50(%x20)[8byte] -> %x0
+4 L3 @0x0000fff90ebf0d80 52800021 movz $0x0001 lsl $0x00 -> %w1
+8 L3 @0x0000fff90ebe6d18 885ffc02 ldaxr (%x0)[4byte] -> %w2
+12 L3 @0x0000fff90ebe6858 7100005f subs %w2 $0x0000 lsl $0x00 -> %wzr
+16 L3 @0x0000fff90ebe89d8 54000061 b.ne $0x0000ffff92783b10
+20 L4 @0x0000fff90ebe7e98 14000000 b $0x0000ffff92783b08
END 0x0000fff90ebe6d18
So, I mangle store in one bb and don't mangle load in another. Is there a way to disable splitting monitor regions? Kirill
branch in between them and it loads from DR's indirect branch hashtable which seems quite unlikely. It would be good to understand this. @AssadHashmi any ideas here?
@abhinav92003 points out that to link far-away blocks beyond the reach of AArch64 direct branches, DR is using a
stp
andldr
to link through the stub: https://dynamorio.org/page_aarch64_far.html This is likely the explanation, that for apps with enough code that far links are needed there is actually memory traffic between a ldex and stex in separate blocks.
Sorry, this got filtered out! My (late) response is to confirm @abhinav92003's explanation that it's probably due to generating code for far-link memory accesses between load/store exclusives.
we are seeing that SPECjvm 2008 runs won't even start the warm-up phase when launched with drrun. Typically specjvm runs may look like this:
with drrun we never get to this first message. I do see two threads running for short period but not convinced runs is successful since it never gets to warm-up and execution phase of the test. Although memory utilization is roughly 11GB which is quite high for sparse.small
attached log debuglevel 3 for the java pid java.log.zip
java.0.59824.zip