Desperado17 commented 4 years ago

I tested DynamoRIO-AArch64-Linux-8.0.0-1 and drcov with a large application of ca. 150 threads. Yocto Linux. The program seems to get stuck at certain points. Attaching with gdb shows that the program still runs but contends a mutex that has to do with fragment writing/reading:

DynamoRIO-AArch64-Linux-8.0.0-1

/media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/bin64/drrun -root /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1 -t drcov -logdir /media/mp000/ -dump_binary -- /bin/myprogram

The readers ( dozens ):

Thread 42 (LWP 2650):
#0 0x000000007112fd08 in dynamorio_syscall ()
#1 0x000000007105b2a8 in d_r_read_lock (rw=0x7f0fefe5ac) at /home/travis/build/DynamoRIO/dynamorio/core/utils.c:1185
#2 0x0000000071048730 in fragment_lookup_type (lookup_flags=15, tag=0x7f0c28431c "\342\003", dcontext=0x7f110859e0) at /home/travis/build/DynamoRIO/dynamorio/core/fragment.c:2738
#3 fragment_lookup (tag=0x7f0c28431c "\342\003", dcontext=0x7f110859e0) at /home/travis/build/DynamoRIO/dynamorio/core/fragment.c:2756
#4 fragment_lookup_fine_and_coarse (dcontext=dcontext@entry=0x7f110859e0, tag=0x7f0c28431c "\342\003", wrapper=wrapper@entry=0x7f10c44fc8, last_exit=0x7f10e09a48) at /home/travis/build/DynamoRIO/dynamorio/core/fragment.c:8109
#5 0x0000000071057d80 in dispatch_exit_fcache_stats (dcontext=0x7f110859e0) at /home/travis/build/DynamoRIO/dynamorio/core/dispatch.c:1416
#6 dispatch_enter_dynamorio (dcontext=0x7f110859e0) at /home/travis/build/DynamoRIO/dynamorio/core/dispatch.c:856
#7 d_r_dispatch (dcontext=0x7f110859e0) at /home/travis/build/DynamoRIO/dynamorio/core/dispatch.c:164
#8 0x0000007f0c2842fc in ?? ()

The writer who slows everything down:

#0 0x000000007104257c in hashtable_fragment_add (dcontext=dcontext@entry=0x7f14cc9900, e=0x7f130e9e48, table=table@entry=0x7f0fefe568) at /home/travis/build/DynamoRIO/dynamorio/core/hashtablex.h:782
782 /home/travis/build/DynamoRIO/dynamorio/core/hashtablex.h: No such file or directory.
(gdb) bt
#0 0x000000007104257c in hashtable_fragment_add (dcontext=dcontext@entry=0x7f14cc9900, e=0x7f130e9e48, table=table@entry=0x7f0fefe568) at /home/travis/build/DynamoRIO/dynamorio/core/hashtablex.h:782
#1 0x00000000710423d0 in hashtable_fragment_add (table=0x7f0fefe568, e=, dcontext=0x7f14cc9900) at /home/travis/build/DynamoRIO/dynamorio/core/hashtablex.h:768
#2 hashtable_fragment_check_size (add_now=1, add_later=0, table=0x7f0fefe568, dcontext=0x7f14cc9900, dcontext@entry=0x7115a000 <reg_names+544>) at /home/travis/build/DynamoRIO/dynamorio/core/hashtablex.h:1102
#3 hashtable_fragment_add (dcontext=dcontext@entry=0x7f14cc9900, e=0x7f159b2860, table=table@entry=0x7f0fefe568) at /home/travis/build/DynamoRIO/dynamorio/core/hashtablex.h:780
#4 0x0000000071043148 in hashtable_fragment_add (table=0x7f0fefe568, e=, dcontext=0x7f14cc9900) at /home/travis/build/DynamoRIO/dynamorio/core/hashtablex.h:768
#5 fragment_add_to_hashtable (table=0x7f0fefe568, e=, dcontext=0x7f14cc9900) at /home/travis/build/DynamoRIO/dynamorio/core/fragment.c:872
#6 fragment_add (dcontext=0x7f14cc9900, f=) at /home/travis/build/DynamoRIO/dynamorio/core/fragment.c:2975
#7 0x0000000071059efc in emit_fragment_common (dcontext=dcontext@entry=0x7f14cc9900, tag=0x200000001 <error: Cannot access memory at address 0x200000001>, tag@entry=0x3028a30 "\202\n@\371C", ilist=0x7115a000 <reg_names+544>, flags=, vmlist=0x3028a30, link_fragment=false, link_fragment@entry=255, add_to_htable=false, add_to_htable@entry=true, replace_fragment=0x7f14cc9900, replace_fragment@entry=0x0) at /home/travis/build/DynamoRIO/dynamorio/core/emit.c:810
#8 0x0000000071059f48 in emit_fragment_ex (dcontext=dcontext@entry=0x7f14cc9900, tag=tag@entry=0x3028a30 "\202\n@\371C", ilist=, flags=, vmlist=, link=link@entry=255, visible=visible@entry=true) at /home/travis/build/DynamoRIO/dynamorio/core/emit.c:864
#9 0x000000007110232c in build_basic_block_fragment (dcontext=dcontext@entry=0x7f14cc9900, start=0x3028a30 "\202\n@\371C", initial_flags=initial_flags@entry=0, link=255, link@entry=true, visible=visible@entry=true, for_trace=for_trace@entry=false, unmangled_ilist=unmangled_ilist@entry=0x0) at /home/travis/build/DynamoRIO/dynamorio/core/arch/interp.c:5167
#10 0x0000000071057668 in d_r_dispatch (dcontext=0x7f14cc9900) at /home/travis/build/DynamoRIO/dynamorio/core/dispatch.c:214
#11 0x0000000003028a30 in ?? ()

Also, strace shows an above average time spent in sched_yield.

Derek Bruening told me to do some profiling in https://github.com/DynamoRIO/dynamorio/issues/4278 Please tell me what you need to know.

derekbruening commented 4 years ago

You are showing a SIGSEGV: is that mistakenly pasted from #4278?

What phase of the application is this? Is it running a ton of new code (100's of K's of blocks, millions of blocks)? How many blocks are built over what time span? In release this might be sthg like (gdb) p d_r_stats->num_bbs_pair.value but if this is not process init it would be best to compare to the value prior to this slow phase.

Time in sched_yield: that is unusual; is it from a synchall for reset.

Do these runtime options change anything: -disable_traces -no_enable_reset -vm_size 2G?

Does running without drcov change anything?

Note that some others are also seeing some AArch64 issues scaling to larger apps but I don't think the symptoms were a match for this.

Desperado17 commented 4 years ago

I only wanted to copy the command line. Corrected it.

The three options don't make a difference.

About (gdb) p d_r_stats->num_bbs_pair.value

In my test run it was 247490 about 3 seconds before it stuck. Then it was 289916 and remained at this value for several minutes.

Desperado17 commented 4 years ago

A snapshot with perf top:

Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 117614980150
Overhead Shared Object Symbol
34.29% [kernel] [k] _raw_spin_unlock_irq
19.00% libdynamorio.so [.] d_r_read_lock
10.57% [kernel] [k] cpuidle_enter_state
6.93% [kernel] [k] el0_svc_naked
4.23% [kernel] [k] _raw_spin_unlock_irqrestore
3.85% libdynamorio.so [.] dynamorio_syscall
1.04% libdynamorio.so [.] d_r_mutex_lock
0.91% libdynamorio.so [.] d_r_get_thread_id
0.49% [kernel] [k] sys_sched_yield
0.47% [kernel] [k] sched_text_start
0.46% [kernel] [k] fpsimd_load_state
0.35% [kernel] [k] delay
0.30% libsse.so [.] FFT_Float
0.26% [kernel] [k] schedule
0.21% [kernel] [k] __local_bh_enable_ip
0.20% [kernel] [k] _raw_spin_lock_irqsave
0.17% [kernel] [k] ret_fast_syscall
0.17% perf [.] 0x00000000000a2260
0.17% perf [.] 0x0000000000134924
0.17% perf [.] 0x0000000000095108
0.16% [kernel] [k] do_notify_resume

Desperado1985 commented 4 years ago

About the sched yield issue, here an exceprt from strace (tracing all events):

[pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0

derekbruening commented 4 years ago

Any chance of a callstack for a sched_yield instance?

Desperado1985 commented 4 years ago

[pid  3025] <... sched_yield resumed> ) = 0
 > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x12fd08]
^C > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_read_lock+0xa8) [0x5b2a8]
 > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_lookup_fine_and_coarse+0x98) [0x48730]
 > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_dispatch+0xc50) [0x57d80]
 > /lib/libpthread-2.25.so(pthread_cond_timedwait+0x204) [0xd2fc]

[pid  3019] <... sched_yield resumed> ) = 0
 > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x12fd08]
 > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_read_lock+0xa8) [0x5b2a8]
 > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_lookup_fine_and_coarse+0x98) [0x48730]
 > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_dispatch+0xc50) [0x57d80]
 > /lib/libpthread-2.25.so(pthread_cond_timedwait+0x204) [0xd2fc]

Desperado1985 commented 4 years ago

Note: The program has multiple message queue threads that wait for new messages on a pthread_cond_wait or pthread_cond_timedwait semaphore. Since there is a lot of messaging some semaphores are woken up very often.

derekbruening commented 4 years ago

Is it possible to run the same application on x86 to try to see whether this is specific to arm?

Try -no_spin_yield_rwlock?

Building a ton of fragments is slow, but typically we see see the decoding, encoding, etc. as the bottleneck, not locks, partly b/c most application code bursts we dealt with in the past were up-front initialization or phases where one thread goes and runs most of the new code on its own. There are probably opportunities to reduce lock contention, but it seems best to first understand the high level.

The first post shows one block being created, and other threads waiting on locks in dispatch: the question there is, why are these other threads in dispatch? Is there a problem where they are exiting the cache when they shouldn't, or are they all running the new code in lockstep? Maybe look at debug build stats on cache exits and hope debug build behavior looks similar to release? Or hack the code to enable the exit-time stats and code in release.

Desperado1985 commented 4 years ago

I tried -no_spin_yield_rwlock. It did not change much in regard to perf top:

Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 95085246825
Overhead Shared Object Symbol
35.92% [kernel] [k] _raw_spin_unlock_irq
19.72% libdynamorio.so [.] d_r_read_lock
9.66% [kernel] [k] cpuidle_enter_state
6.98% [kernel] [k] el0_svc_naked
4.63% [kernel] [k] _raw_spin_unlock_irqrestore
4.09% libdynamorio.so [.] dynamorio_syscall
1.01% libdynamorio.so [.] d_r_get_thread_id
0.57% [kernel] [k] fpsimd_load_state
0.56% [kernel] [k] sched_text_start
0.46% [kernel] [k] sys_sched_yield
0.30% [kernel] [k] schedule
0.30% libsse.so [.] FFT_Float
0.24% [kernel] [k] _raw_spin_lock_irqsave
0.22% [kernel] [k] __local_bh_enable_ip
0.21% [kernel] [k] delay
0.18% [kernel] [k] ret_fast_syscall
0.17% perf [.] 0x00000000000a2260
0.15% libsse.so [.] StarRec_PAL_CplxFltMultSum
0.14% libc-2.25.so [.] memcpy
0.14% [kernel] [k] __arm_smccc_smc
0.13% libc-2.25.so [.] strcmp

I attached strace again. A lot of the sched_yield calls seem to be caused in part of exceptions:

[pid 2989] <... sched_yield resumed> ) = 0

/media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x12fd08] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_read_lock+0xa8) [0x5b2a8] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_lookup_type+0x80) [0x3cde8] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_add_ibl_target+0x2d8) [0x48600] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_dispatch+0x1310) [0x58440] /usr/bin/myapp(_ZTv0_n32_NK5boost16exception_detail10clone_implINS0_19error_info_injectorINSt8ios_base7failureB5cxx11EEEE7rethrowEv+0x2be14) [0x2bc4394] /usr/bin/myapp(_ZTv0_n32_NK5boost16exception_detail10clone_implINS0_19error_info_injectorINSt8ios_base7failureB5cxx11EEEE7rethrowEv+0x2d754) [0x2bc5cd4] /usr/lib/libprotobuf.so.10.0.0(_ZN6google8protobuf11MessageLite20ParseFromCodedStreamEPNS0_2io16CodedInputStreamE+0x38) [0xf1880]

Since we are tracing code paths and exceptions are a little bit less predictable than regular method flow, could there be an overhead with them?

Desperado17 commented 4 years ago

Greetings,

since there hasn't been much progress on this issue for a while, I have a question: Most Linux mutex implementations make it so that the call doesn't go into the kernel if the Mutex isn't contended. Is it possible that dynamorio makes more syscalls than the native implementation would?

derekbruening commented 4 years ago

since there hasn't been much progress on this issue for a while, I have a question: Most Linux mutex implementations make it so that the call doesn't go into the kernel if the Mutex isn't contended. Is it possible that dynamorio makes more syscalls than the native implementation would?

DR is the same: an atomic instruction tries to acquire and SYS_futex is only used on contention. There are also options for first spinning. So I would not expect any more syscalls, no.

Desperado17 commented 2 years ago

Unfortunately, with version 9.0.0 this problem has not yet been solved. Another output from perf top:

Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 106272578188
Overhead Shared Object Symbol
26.49% [kernel] [k] _raw_spin_unlock_irq
14.01% [kernel] [k] cpuidle_enter_state
11.36% libdynamorio.so [.] atomic_add_exchange_int
7.99% libdynamorio.so [.] ATOMIC_INC_int
5.54% [kernel] [k] _raw_spin_unlock_irqrestore
5.01% [kernel] [k] el0_svc_naked
2.99% libdynamorio.so [.] dynamorio_syscall
2.56% libdynamorio.so [.] deadlock_avoidance_lock
0.66% libdynamorio.so [.] is_thread_tls_initialized
0.66% libdynamorio.so [.] d_r_read_lock
0.51% [snd_pcm] [k] snd_pcm_stream_unlock_irq
0.49% libdynamorio.so [.] mutex_testlock
0.44% libdynamorio.so [.] get_tls_thread_id
0.44% libdynamorio.so [.] os_thread_yield
0.41% [kernel] [k] fpsimd_load_state
0.39% [kernel] [k] delay
0.37% [kernel] [k] sched_text_start
0.37% [snd_pcm] [k] snd_pcm_stream_unlock_irqrestore
0.31% libdynamorio.so [.] read_thread_register
0.29% [kernel] [k] sys_sched_yield
0.29% libdynamorio.so [.] rwlock_wait_contended_reader
0.29% [kernel] [k] _raw_spin_lock_irqsave
0.28% libsse.so [.] FFT_Float
0.24% [kernel] [k] schedule
0.24% libdynamorio.so [.] get_dr_tls_base_addr
0.22% libdynamorio.so [.] atomic_dec_becomes_zero
0.21% perf [.] 0x00000000000a2260
0.20% libdynamorio.so [.] d_r_get_thread_id
0.20% libc-2.25.so [.] memcpy
0.19% [kernel] [k] ret_fast_syscall
0.14% libc-2.25.so [.] strcmp
0.14% [kernel] [k] __fget
0.14% perf [.] 0x000000000013491c
0.13% perf [.] 0x00000000000e481c
0.13% [kernel] [k] do_notify_resume

derekbruening commented 2 years ago

deadlock_avoidance_lock only exists in debug build. Debug build does a lot of extra stuff and is expected to be slow and congested. Please evaluate release build only for performance.

Desperado17 commented 2 years ago

Still hangs without debug flag:

Samples: 2M of event 'cpu-clock:pppH', Event count (approx.): 98419339809
Overhead Shared Object Symbol
34.06% [kernel] [k] _raw_spin_unlock_irq
19.23% libdynamorio.so [.] d_r_read_lock
11.20% [kernel] [k] cpuidle_enter_state
7.52% [kernel] [k] _raw_spin_unlock_irqrestore
6.48% [kernel] [k] el0_svc_naked
3.61% libdynamorio.so [.] dynamorio_syscall
0.98% libdynamorio.so [.] d_r_get_thread_id
0.58% [kernel] [k] fpsimd_load_state
0.45% [kernel] [k] sched_text_start
0.38% [kernel] [k] sys_sched_yield
0.36% [kernel] [k] _raw_spin_lock_irqsave
0.33% [kernel] [k] schedule
0.30% libsse.so [.] FFT_Float
0.26% perf [.] 0x000000000013491c
0.23% [kernel] [k] ret_fast_syscall
0.21% [kernel] [k] delay
0.20% [kernel] [k] local_bh_enable_ip
0.19% [kernel] [k] do_notify_resume
0.17% perf [.] 0x00000000000a2260
0.16% perf [.] 0x0000000000095108
0.15% [kernel] [k] fpsimd_restore_current_state
0.15% [snd_pcm] [k] snd_pcm_stream_unlock_irq
0.13% [kernel] [k] arm_smccc_smc
0.13% libc-2.25.so [.] strcmp
0.13% libsse.so [.] sseAnalysisGetFFTInput
0.12% libc-2.25.so [.] memcpy
0.11% libsse.so [.] StarRec_PAL_CplxFltMultSum
0.11% libdynamorio.so [.] get_dr_tls_base_addr
0.11% perf [.] 0x00000000000e481c
0.09% libsse.so [.] StarRec_PAL_CplxFltMultConjSum
0.09% perf [.] 0x00000000000e3dac
0.09% [kernel] [k] load_balance
0.09% [kernel] [k] __fget
0.09% [snd_pcm] [k] snd_pcm_stream_unlock_irqrestore
0.09% perf [.] 0x00000000000df324
0.09% libspeexdsp.so.1.5.0 [.] speex_resampler_process_int
0.08% [kernel] [k] memcpy
0.08% libpthread-2.25.so [.] pthread_mutex_lock
0.08% [kernel] [k] test_and_clear_bit
0.07% libspeexdsp.so.1.5.0 [.] 0x0000000000006bcc
0.07% libc-2.25.so [.] memset

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- myapp

Attaching with gdb shows that some threads are in dynamorio_syscall. One is in encode_opndsgen_3d000000_003fffff in instr_inline_api.h:379

Desperado17 commented 2 years ago

During an strace run, a segfult has appeared:

DrCov internal crash at PC 0x000000007117198c. Please report this at http://dynamorio.org/issues. Program aborted. Received SIGSEGV at pc 0x000000007117198c in thread 2753 Base: 0x0000000071000000 Registers: eflags=0x0000000060000000 version 9.0.0, custom build -no_dynamic_options -client_lib '/media/mp000/DynamoRIO-AArch64-Linux-9.0.0/tools/lib64/release/libdrcov.so;0;"-dump_binary"' -client_lib64 '/media/mp000/DynamoRIO-AArch64-Linux-9.0.0/tools/lib64/release/libdrcov.so;0;"-dump_binary"' -client_lib32 '/media/mp000/DynamoRIO-AArch64-Linux-9.0.0/tools/lib32/release/libdrcov.s

Desperado17 commented 2 years ago

Another strace run. It's still that a large number of sched_yield calls is produced from which the app apparently does not recover.

[pid 3201] <... sched_yield resumed> ) = 0

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x180200] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_read_lock+0xac) [0x3946c] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(fragment_add_ibl_target+0x444) [0x27cd4] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_dispatch+0x13d4) [0x3683c] /usr/lib/libglslcompiler.so(PostEvaluateNode+0x360) [0x4d280] /usr/lib/libglslcompiler.so(PostEvaluateNode+0x3dc) [0x4d2fc] /usr/lib/libglslcompiler.so(PostEvaluateNode+0x3dc) [0x4d2fc] /usr/lib/libglslcompiler.so(ICProcessNodeOperand+0xb8) [0x4d458] /usr/lib/libglslcompiler.so(ProcessNodeBasicOp+0x6c) [0x4e7fc] /usr/lib/libglslcompiler.so(ProcessNode+0x2b4) [0x4ed34] /usr/lib/libglslcompiler.so(ProcessNode+0xc3c) [0x4f6bc] /usr/lib/libglslcompiler.so(ICTraverseAST+0xbc) [0x5210c] /usr/lib/libglslcompiler.so(ICTraverseAST+0xe34) [0x52e84] /usr/lib/libglslcompiler.so(ICTraverseAST+0xe34) [0x52e84] /usr/lib/libglslcompiler.so(ICTraverseAST+0xe34) [0x52e84] /usr/lib/libglslcompiler.so(ICTraverseAST+0x10e4) [0x53134] /usr/lib/libglslcompiler.so(GenerateICodeProgram+0x11c) [0x551fc] /usr/lib/libglslcompiler.so(GLSLCompileToIntermediateCode+0x520) [0x5f060] /usr/lib/libglslcompiler.so(GLSLCompileToUniflex+0x240) [0x9c6a0] /usr/lib/libGLESv2.so(DoCompileShader+0x2d0) [0x610e0] /usr/lib/libGLESv2.so(RecompileShader+0x4c) [0x67c9c] /usr/lib/libGLESv2.so(GetRecompiledShaderVariantInternal+0x2fc) [0x694ec] /usr/lib/libGLESv2.so(ProgramOptimisationRecompilationTask+0x163c) [0x6cf6c] /usr/lib/libsrv_um.so(DeferredWorker+0xdc) [0x187fc] /lib/libpthread-2.25.so(pthread_get_minstack+0x13d0) [0x6ef8] /lib/libpthread-2.25.so(pthread_get_minstack+0x13d0) [0x6ef8] [pid 3197] <... sched_yield resumed> ) = 0 /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x180200] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_read_lock+0xac) [0x3946c] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(fragment_lookup_fine_and_coarse+0xc4) [0x27ec4] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_dispatch+0x584) [0x359ec]

Desperado17 commented 2 years ago

Btw will Dynamorio typically increase the amount of stack memory a thread needs? Because I'm tight on that.

derekbruening commented 2 years ago

Still hangs without debug flag:

For an actual hang, getting symbolic callstacks for every thread is best to understand what's going on (deadlock? which lock? etc.).

During an strace run, a segfult has appeared:

Symbolic callstack is again best, and once the crash details are known, if there is no existing issue, best to file a new issue.

Btw will Dynamorio typically increase the amount of stack memory a thread needs? Because I'm tight on that.

No, DR's stacks are statically sized. -stack_size controls the size.

derekbruening commented 2 years ago

Still hangs without debug flag:

Xref things like #5245, #5247

derekbruening commented 2 years ago

Does -no_ldstex2cas make a difference?

Desperado17 commented 2 years ago

-no_ldstex2cas didn't work.

I made another thread apply all bt full in gdb but I had to remove our symbols.

newdynamoriologging.txt

Ok, it does process marginally further on average when using the flag.

The final state is hard to capture because the system slows down until it is unusuable.

Desperado17 commented 2 years ago

aarch64 (6 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle 31.76 0.04 45.27 7.55 0.00 15.38

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn mmcblk0 12.31 504.73 7.82 1372371 21256 mmcblk0p3 0.12 1.26 0.90 3418 2448 mmcblk0p4 0.07 0.40 0.41 1083 1104 mmcblk0p5 0.01 0.14 0.04 386 96 mmcblk0p6 5.61 327.86 0.06 891450 168 mmcblk0p8 6.49 175.05 6.41 475954 17440 dm-0 0.06 0.13 0.01 352 28 sda 0.64 5.86 0.02 15934 64 sdb 6.58 97.33 0.02 264645 48 sdb1 6.58 97.33 0.02 264637 48

Is iowait with 7.55 % normal?

It also has phases where perf top shows raw_spin_unlock_irq

with nearly 100%.

Edit: Nailed it to 4 cpus now so the system remains usable and the app completely hogs ALL of them.

derekbruening commented 2 years ago

newdynamoriologging.txt looks like the very first comment in this issue: one thread trying to get the fragment hashtable write lock and waiting for a bunch of readers (with a whole bunch of readers running debug-build-only code).

4, 5, 11, 43, 44, 49, 53, 61, 67, 74, 81, 87, 93, 101, 105, 111, 117, 121, 132, 141, 144, 164, 165, 168, 169,: read fragment lookup: debug-only
12, 14, 29, 39, 62: read fragment lookup: dispatch
170: write lock: fragment_add for build_basic_block_fragment
180: unlock enter_couldbelinking

So we're back to the comment above https://github.com/DynamoRIO/dynamorio/issues/4279#issuecomment-625914990 where I would re-suggest checking the exit reasons in the debug stats as this seems like a lot of time spent in dispatch if this is a period of little block building and it should be steady-state. Maybe there is some problem causing threads to exit the cache when they shouldn't and there's always a reader, blocking the rare writers who are building new blocks.

Look for the Fcache exits stats. E.g. at the end of a thread log file:

                               Fcache exits, total (thread):              1422
              Fcache exits, system call executions (thread):               108
                         Fcache exits, from traces (thread):               160
                            Fcache exits, from BBs (thread):              1154
             Fcache exits, total indirect branches (thread):               548
         Fcache exits, non-trace indirect branches (thread):               446
             Fcache exits, ind target not in cache (thread):                69
...

I would also run release build with -prof_pcs to see where time is spent to confirm it's a problem with exits to dispatch. The log dir will have output files with a summary and then detailed list. The summary is like:

ITIMER distribution (20):
   70.0% of time in INDIRECT BRANCH LOOKUP (14)
   30.0% of time in FRAGMENT CACHE (6)

Desperado17 commented 2 years ago

Will a short run suffice or do I have to go to the very end? During the late stages, the machine doesn't really react anymore.

I made a run with the same app on X64 Linux desktop btw and it works without problems.

Edit: I don't use -debug flag. Why does it still run debug code? Can you tell me again the name of the debug indicating function?

Desperado17 commented 2 years ago

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -debug -loglevel 2 -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp (code=exited, status=0/SUCCESS)

Aborted program after 2 minutes or so.

I had to be careful with proprietary info and only included the "Fcache exits" entries.

dynamoriowithdebug.txt

I have another run in the pipeline for the ITIMER distribution but it may take a while because the program slows down.

Desperado17 commented 2 years ago

Ok, here we are: pcsamples.0.3804.html.zip

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp

derekbruening commented 2 years ago

In steady state you want at least 90-95% of time in FRAGMENT CACHE so this profile indicates massive overhead:

ITIMER distribution (11161):
   11.8% of time in INTERPRETER (1317)
   80.8% of time in DISPATCH (9013)
    0.1% of time in MONITOR (6)
    0.5% of time in SYSCALL HANDLER (55)
    0.2% of time in CONTEXT SWITCH (22)
    1.7% of time in INDIRECT BRANCH LOOKUP (186)
    5.0% of time in FRAGMENT CACHE (557)
    0.0% of time in UNKNOWN (5)

That's for ~280K basic blocks I assume based on the stats from https://github.com/DynamoRIO/dynamorio/issues/4279#issuecomment-622561977

11161 hits @ 10ms each is 112 seconds.

During periods of a lot of new code being executed you would expect overhead and INTERPRETER time, but not this massive DISPATCH time.

Block building is not super fast: IIRC DR takes about 1s to build 100K blocks. So ~300K blocks might take at least 3s to build. But the rest here seems to be the blocks being shared among many threads right from the start?

I would try -thread_private and see what happens.

Another thing to look at is KSTATS to understand where time is going. E.g., from a debug -checklevel 0 (to speed up debug build) run's log file:

Process KSTATS:
     thread_measured:26504473804 totc,      17 num, 212466228 minc,1559086694 avg,23016297728 maxc,  16944740 self,26487529064 sub,
                         8319 ms,         0 ms out,total measured and propagated in thread
         bb_building:23023119868 totc,  140634 num,     15652 minc,    163709 avg,   1901178 maxc,2302651016 self,20720468852 sub,
                         7226 ms,         0 ms out,in bb building
         bb_decoding:2600946572 totc,  140634 num,      1872 minc,     18494 avg,   1459506 maxc,2600946572 self,         0 sub,
                          816 ms,         0 ms out,in bb decoding
             bb_emit:15430386644 totc,  140634 num,      8546 minc,    109720 avg,   1852416 maxc,  80126850 self,15350259794 sub,
                         4843 ms,         0 ms out,in emitting BB
            mangling:2689135636 totc,  140634 num,       944 minc,     19121 avg,   1004816 maxc,2689135636 self,         0 sub,
                          844 ms,         0 ms out,in mangling
                emit:15350510724 totc,  140639 num,      7896 minc,    109148 avg,   1851802 maxc,15350510724 self,         0 sub,
                         4818 ms,         0 ms out,in emit
      trace_building:    941456 totc,      19 num,     10082 minc,     49550 avg,    151324 maxc,    366008 self,    575448 sub,
                            0 ms,         0 ms out,in trace building
...

Desperado17 commented 2 years ago

That's for ~280K basic blocks I assume based on the stats from #4279 (comment)

I cannot exactly reproduce the situation from an earlier run because I need to guess when to stop because it doesn't even get past startup and I need to cancel it to get usable stats.

Run with thread_private:

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -thread_private -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp

Last output from perf top before system slows down to a crisp:

Samples: 916K of event 'cpu-clock:pppH', Event count (approx.): 101452777118
Overhead Shared Object Symbol
31.14% [kernel] [k] cpuidle_enter_state
14.65% libdynamorio.so [.] d_r_mutex_lock
6.06% [kernel] [k] _raw_spin_unlock_irq
1.65% [kernel] [k] _raw_spin_unlock_irqrestore
1.15% libdynamorio.so [.] d_r_read_lock
0.76% [kernel] [k] el0_svc_naked
0.69% libdynamorio.so [.] instr_writes_to_exact_reg
0.52% [snd_pcm] [k] snd_pcm_stream_unlock_irq
0.45% libdynamorio.so [.] hashtable_fragment_lookup.isra.0
0.40% libdynamorio.so [.] instr_reg_in_src
0.39% perf [.] 0x0000000000095108
0.37% [kernel] [k] arm_smccc_smc
0.34% libdynamorio.so [.] dynamorio_syscall
0.33% libjvm.so [.] 0x00000000009d2e08
0.25% libdrreg.so [.] drreg_event_bb_analysis
0.25% libdynamorio.so [.] opnd_uses_reg
0.25% libc-2.25.so [.] memcpy
0.25% [kernel] [k] delay
0.24% libsse.so [.] FFT_Float
0.23% perf [.] 0x0000000000095104
0.23% libdrreg.so [.] drvector_set_entry
0.23% libdynamorio.so [.] acquire_recursive_lock
0.22% libdynamorio.so [.] instr_reads_from_reg
0.21% libdynamorio.so [.] decode_common

I will try to produce logs after successful abort.

Desperado17 commented 2 years ago

I ran the app with the following command line

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -debug -checklevel 0 -thread_private -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp

But none of the logs seems to include a KSTATS parameter. pcsamples is even empty. As usual, it was very hard to end the program, it took 15 minutes after my CTRL-C.

derekbruening commented 2 years ago

The pcsamples and the process-wide KSTATS are printed at the end. They should still be printed on any app exit (including crash) though; probably not on a DR crash since it assumes it can't trust its own internal state.

The other thing is that there are lock contention stats in the debug logs which can help figure out which locks are having problems:

25 lock 0x00007f589ef38420: name=bb_building_lock(mutex)@/home/bruening/dr/git/src/core/arch/interp.c:119
rank=5 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x00007f569e9a46c0
lock count_times_acquired=     641 count_times_contended=      43                               0 count_times_spin_only=      30 max_contended_requests=       1+2 bb_building_lock(mutex)@/home/bruening/dr/git/src/core/arch/interp.c:119

I would look at the pcsamples for -thread_private and would expect the dispatch problem to have disappeared, and that lock contention has also disappeared as a problem, maybe replaced by memory usage and duplication of work building the same blocks for each thread.

I thought there was an explicit issue on possibly splitting the bb_building_lock up but I can't find it now. When we first created thread-shared caches we started with simple global locks and split up the ones that had contention issues, and were surprised to learn that keeping a single global bb_building_lock worked well for all our workloads. Maybe that is not the case here: but it seems best to prove to ourselves that this is indeed the problem. Adding the lock contention stats to the above callstacks piling up on the lock would help there, along with the thread_private stats. Having thread_private also perform horribly does not help the theory that it's the global locks: unless we can prove that the problems with thread_private are something unique to that mode (otherwise there could be some other issue that is shared across private and shared causing all the problems).

derekbruening commented 2 years ago

From the callstacks it wasn't nec the bb_building_lock contention: it was the shared block hashtable. Though if the underlying problematic behavior here is that all threads run the exact same large set of new code at the same time, if they get through their hashtable lookups, they'll all want to build the same blocks next. But, maybe removing this builder waiting for all the readers would still be an improvement, even if the underlying issue of widely shared initialization code will always have some overhead.

derekbruening commented 2 years ago

Re: the prior comment: but -thread_private should get rid of that: so understanding why -thread_private has just as much overhead seems key to forward progress.

Desperado17 commented 2 years ago

dynamorio

A screenshot of a perf trace. Seems that it spends most of the time in d_r_dispatch itself. Can we be sure it enters that loop only once most of the time?

derekbruening commented 2 years ago

I assume you're still looking at thread-shared. Please analyze thread-private per the discussion as it being slow is confusing, Repeating earlier questions:

What are the pcsamples breakdowns and the lock contention stats for thread-private: you would expect it to no longer contend for the hashtable lock on lookups from dispatch and in fact have little lock contention at all (but spend a bunch of time doing duplicated work building the same blocks in every thread -- but uncontended)

Desperado17 commented 2 years ago

No, the recent runs all use thread-private. I have problems producing another pcsamples output because the program slows down so much that I cannot shut it down properly.

derekbruening commented 2 years ago

I believe that stack trace in https://github.com/DynamoRIO/dynamorio/issues/4279#issuecomment-1020278239 with fragment_lookup_type acquiring a lock is impossible for -thread_private: look at the code https://github.com/DynamoRIO/dynamorio/blob/master/core/fragment.c#L2625. -shared_traces and -shared_bbs are false for -thread_private. If you could verify that -thread_private is actually set for that run, and if so, please debug what is going on, since it seems impossible.

Desperado17 commented 2 years ago

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -thread_private -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp

This command line was used for #4279.

I set a breakpoint in gdb at fragment.c:2625 and it stops regularly at fragment.c:2718 probably because fragment_lookup_type is static. Can you name me a line in which it should absolutely not stop if thread_private does what it should?

Edit: When I attach with gdb I almost immediately get a segfault here:

https://github.com/DynamoRIO/dynamorio/blob/master/core/utils.c#L1109

0 d_r_read_lock (rw=0x7da07ea404)

at /home/runner/work/dynamorio/dynamorio/core/utils.c:1109

No locals.

1 0x0000000071027ec4 in fragment_lookup_type (dcontext=0x7da0808080,

lookup_flags=15, 
tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:2700
    f = <optimized out>

2 fragment_lookup_type (lookup_flags=15,

tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>, 
dcontext=0x7da0808080)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:2625
    f = <optimized out>

3 fragment_lookup (

tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>, 
dcontext=0x7da0808080)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:2718

No locals.

4 fragment_lookup_fine_and_coarse (dcontext=dcontext@entry=0x7da0808080,

tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>, 
wrapper=wrapper@entry=0x7da082efc8, last_exit=0x7daf3bd758)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:8066
    res = <optimized out>

---Type to continue, or q to quit---

5 0x00000000710355c4 in dispatch_exit_fcache_stats (dcontext=0x7da0808080)

at /home/runner/work/dynamorio/dynamorio/core/dispatch.c:1453
    next_f = <optimized out>
    last_f = <optimized out>
    coarse_f = <error reading variable coarse_f (Cannot access memory at address 0x7da082efc8)>

6 dispatch_enter_dynamorio (dcontext=0x7da0808080)

at /home/runner/work/dynamorio/dynamorio/core/dispatch.c:880
    wherewasi = DR_WHERE_FCACHE

7 d_r_dispatch (dcontext=0x7da0808080)

at /home/runner/work/dynamorio/dynamorio/core/dispatch.c:160
    targetf = <optimized out>
    coarse_f = <error reading variable coarse_f (Cannot access memory at address 0x7da082efc8)>

8 0x0000007da03a83d4 in ?? ()

No symbol table info available.

9 0x0000007da0776b90 in ?? ()

No symbol table info available.

derekbruening commented 2 years ago

As mentioned, -shared_taces and -shared_bbs are off, so it shouldn't go into the if statements in fragment_lookup_type at lines 2682 or 2696. It shouldn't call d_r_read_lock at all from fragment_lookup_type. That's the whole point of thread-private: there are no locks for routine operations on private code caches.

Desperado17 commented 2 years ago

Read edit of last comment.

derekbruening commented 2 years ago

Did you sanity-check the options? Print dynamo_options.thread_private and dynamo_options.shared_bbs. Is this the same pid as the drrun process, or some child or grandchild? Is there a config file for the children overriding the parent options?

Desperado17 commented 2 years ago

The application uses pthreads with LWP. Each thread has its own LWP. I did not change any of the dynamorio config files if you mean that. I shall do another run to read out the variables.

BTW: Is it possible that dynamorio has problems with GDB? Valgrind has.

Desperado17 commented 2 years ago

Ok, this looks interesting: Setting a breakpoint in fragment.c:2718 indicates both dynamo_options.thread_private and dynamo_options.shared_bbs being true.

/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -thread_private -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp

Desperado17 commented 2 years ago

/ PR 361894: if no TLS available, we fall back to thread-private / PC_OPTION_DEFAULT(bool, shared_bbs, IF_HAVE_TLS_ELSE(true, false), "use thread-shared basic blocks")

Are you sure shared_bbs is false by default? My Aarch64 CPU has TLS.

Edit: Adding -no_shared_bbs and dynamo_options.shared_bbs is still true in 2718.

derekbruening commented 2 years ago

-shared_bbs is true by default, but -thread_private turns it off.

I see this issue: #1884. Looks like nobody ever added private cache support to arm or aarch64. I assume debug build gives an error message and all the above runs trying to enable -thread_private were release build which corrects some options and continues (from old days of running Windows services where a fatal error in release build for a misspelled option to bring down the machine was considered unacceptable; maybe should be changed now).

Desperado17 commented 2 years ago

Do you have an explanation for the Segfaults that occur only when I use GDB?

derekbruening commented 2 years ago

Did you rule out safe reads and other expected behavior? https://dynamorio.org/page_debugging.html#autotoc_md140

Desperado17 commented 2 years ago

Yeah, that's probably it.

derekbruening commented 2 years ago

Recently there were several fixes that might have an impact here, especially those adding missing load-acquire barriers in mutex spinlock code and release-build statistics code which could explain slowdowns and hangs. Could you try your app with PR #5367, PR #5370, and PR #5375?

Desperado17 commented 2 years ago

I'm out of office for the weekend. But I'll happily try on monday. Can you tell me when the next weekly build will arrive?

derekbruening commented 2 years ago

Can you tell me when the next weekly build will arrive?

Friday evenings -- so conveniently tonight's will include all 3 of those fixes.

DynamoRIO / dynamorio

Large performance overhead and lock contention while adding and reading fragments on Aarch64 #4279

0 d_r_read_lock (rw=0x7da07ea404)

1 0x0000000071027ec4 in fragment_lookup_type (dcontext=0x7da0808080,

2 fragment_lookup_type (lookup_flags=15,

3 fragment_lookup (

4 fragment_lookup_fine_and_coarse (dcontext=dcontext@entry=0x7da0808080,

5 0x00000000710355c4 in dispatch_exit_fcache_stats (dcontext=0x7da0808080)

6 dispatch_enter_dynamorio (dcontext=0x7da0808080)

7 d_r_dispatch (dcontext=0x7da0808080)

8 0x0000007da03a83d4 in ?? ()

9 0x0000007da0776b90 in ?? ()