Open Desperado17 opened 4 years ago
You are showing a SIGSEGV: is that mistakenly pasted from #4278?
What phase of the application is this? Is it running a ton of new code (100's of K's of blocks, millions of blocks)? How many blocks are built over what time span? In release this might be sthg like (gdb) p d_r_stats->num_bbs_pair.value
but if this is not process init it would be best to compare to the value prior to this slow phase.
Time in sched_yield: that is unusual; is it from a synchall for reset.
Do these runtime options change anything: -disable_traces -no_enable_reset -vm_size 2G
?
Does running without drcov change anything?
Note that some others are also seeing some AArch64 issues scaling to larger apps but I don't think the symptoms were a match for this.
I only wanted to copy the command line. Corrected it.
The three options don't make a difference.
About (gdb) p d_r_stats->num_bbs_pair.value
In my test run it was 247490 about 3 seconds before it stuck. Then it was 289916 and remained at this value for several minutes.
A snapshot with perf top:
Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 117614980150
Overhead Shared Object Symbol
34.29% [kernel] [k] _raw_spin_unlock_irq
19.00% libdynamorio.so [.] d_r_read_lock
10.57% [kernel] [k] cpuidle_enter_state
6.93% [kernel] [k] el0_svc_naked
4.23% [kernel] [k] _raw_spin_unlock_irqrestore
3.85% libdynamorio.so [.] dynamorio_syscall
1.04% libdynamorio.so [.] d_r_mutex_lock
0.91% libdynamorio.so [.] d_r_get_thread_id
0.49% [kernel] [k] sys_sched_yield
0.47% [kernel] [k] sched_text_start
0.46% [kernel] [k] fpsimd_load_state
0.35% [kernel] [k] delay
0.30% libsse.so [.] FFT_Float
0.26% [kernel] [k] schedule
0.21% [kernel] [k] __local_bh_enable_ip
0.20% [kernel] [k] _raw_spin_lock_irqsave
0.17% [kernel] [k] ret_fast_syscall
0.17% perf [.] 0x00000000000a2260
0.17% perf [.] 0x0000000000134924
0.17% perf [.] 0x0000000000095108
0.16% [kernel] [k] do_notify_resume
About the sched yield issue, here an exceprt from strace (tracing all events):
[pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0 [pid 2510] sched_yield() = 0
Any chance of a callstack for a sched_yield instance?
[pid 3025] <... sched_yield resumed> ) = 0
> /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x12fd08]
^C > /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_read_lock+0xa8) [0x5b2a8]
> /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_lookup_fine_and_coarse+0x98) [0x48730]
> /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_dispatch+0xc50) [0x57d80]
> /lib/libpthread-2.25.so(pthread_cond_timedwait+0x204) [0xd2fc]
[pid 3019] <... sched_yield resumed> ) = 0
> /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x12fd08]
> /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_read_lock+0xa8) [0x5b2a8]
> /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_lookup_fine_and_coarse+0x98) [0x48730]
> /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_dispatch+0xc50) [0x57d80]
> /lib/libpthread-2.25.so(pthread_cond_timedwait+0x204) [0xd2fc]
Note: The program has multiple message queue threads that wait for new messages on a pthread_cond_wait or pthread_cond_timedwait semaphore. Since there is a lot of messaging some semaphores are woken up very often.
Is it possible to run the same application on x86 to try to see whether this is specific to arm?
Try -no_spin_yield_rwlock
?
Building a ton of fragments is slow, but typically we see see the decoding, encoding, etc. as the bottleneck, not locks, partly b/c most application code bursts we dealt with in the past were up-front initialization or phases where one thread goes and runs most of the new code on its own. There are probably opportunities to reduce lock contention, but it seems best to first understand the high level.
The first post shows one block being created, and other threads waiting on locks in dispatch: the question there is, why are these other threads in dispatch? Is there a problem where they are exiting the cache when they shouldn't, or are they all running the new code in lockstep? Maybe look at debug build stats on cache exits and hope debug build behavior looks similar to release? Or hack the code to enable the exit-time stats and code in release.
I tried -no_spin_yield_rwlock. It did not change much in regard to perf top:
Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 95085246825
Overhead Shared Object Symbol
35.92% [kernel] [k] _raw_spin_unlock_irq
19.72% libdynamorio.so [.] d_r_read_lock
9.66% [kernel] [k] cpuidle_enter_state
6.98% [kernel] [k] el0_svc_naked
4.63% [kernel] [k] _raw_spin_unlock_irqrestore
4.09% libdynamorio.so [.] dynamorio_syscall
1.01% libdynamorio.so [.] d_r_get_thread_id
0.57% [kernel] [k] fpsimd_load_state
0.56% [kernel] [k] sched_text_start
0.46% [kernel] [k] sys_sched_yield
0.30% [kernel] [k] schedule
0.30% libsse.so [.] FFT_Float
0.24% [kernel] [k] _raw_spin_lock_irqsave
0.22% [kernel] [k] __local_bh_enable_ip
0.21% [kernel] [k] delay
0.18% [kernel] [k] ret_fast_syscall
0.17% perf [.] 0x00000000000a2260
0.15% libsse.so [.] StarRec_PAL_CplxFltMultSum
0.14% libc-2.25.so [.] memcpy
0.14% [kernel] [k] __arm_smccc_smc
0.13% libc-2.25.so [.] strcmp
I attached strace again. A lot of the sched_yield calls seem to be caused in part of exceptions:
[pid 2989] <... sched_yield resumed> ) = 0
/media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x12fd08] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_read_lock+0xa8) [0x5b2a8] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_lookup_type+0x80) [0x3cde8] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(fragment_add_ibl_target+0x2d8) [0x48600] /media/mp000/DynamoRIO-AArch64-Linux-8.0.0-1/lib64/release/libdynamorio.so(d_r_dispatch+0x1310) [0x58440] /usr/bin/myapp(_ZTv0_n32_NK5boost16exception_detail10clone_implINS0_19error_info_injectorINSt8ios_base7failureB5cxx11EEEE7rethrowEv+0x2be14) [0x2bc4394] /usr/bin/myapp(_ZTv0_n32_NK5boost16exception_detail10clone_implINS0_19error_info_injectorINSt8ios_base7failureB5cxx11EEEE7rethrowEv+0x2d754) [0x2bc5cd4] /usr/lib/libprotobuf.so.10.0.0(_ZN6google8protobuf11MessageLite20ParseFromCodedStreamEPNS0_2io16CodedInputStreamE+0x38) [0xf1880]
Since we are tracing code paths and exceptions are a little bit less predictable than regular method flow, could there be an overhead with them?
Greetings,
since there hasn't been much progress on this issue for a while, I have a question: Most Linux mutex implementations make it so that the call doesn't go into the kernel if the Mutex isn't contended. Is it possible that dynamorio makes more syscalls than the native implementation would?
since there hasn't been much progress on this issue for a while, I have a question: Most Linux mutex implementations make it so that the call doesn't go into the kernel if the Mutex isn't contended. Is it possible that dynamorio makes more syscalls than the native implementation would?
DR is the same: an atomic instruction tries to acquire and SYS_futex is only used on contention. There are also options for first spinning. So I would not expect any more syscalls, no.
Unfortunately, with version 9.0.0 this problem has not yet been solved. Another output from perf top:
Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 106272578188
Overhead Shared Object Symbol
26.49% [kernel] [k] _raw_spin_unlock_irq
14.01% [kernel] [k] cpuidle_enter_state
11.36% libdynamorio.so [.] atomic_add_exchange_int
7.99% libdynamorio.so [.] ATOMIC_INC_int
5.54% [kernel] [k] _raw_spin_unlock_irqrestore
5.01% [kernel] [k] el0_svc_naked
2.99% libdynamorio.so [.] dynamorio_syscall
2.56% libdynamorio.so [.] deadlock_avoidance_lock
0.66% libdynamorio.so [.] is_thread_tls_initialized
0.66% libdynamorio.so [.] d_r_read_lock
0.51% [snd_pcm] [k] snd_pcm_stream_unlock_irq
0.49% libdynamorio.so [.] mutex_testlock
0.44% libdynamorio.so [.] get_tls_thread_id
0.44% libdynamorio.so [.] os_thread_yield
0.41% [kernel] [k] fpsimd_load_state
0.39% [kernel] [k] delay
0.37% [kernel] [k] sched_text_start
0.37% [snd_pcm] [k] snd_pcm_stream_unlock_irqrestore
0.31% libdynamorio.so [.] read_thread_register
0.29% [kernel] [k] sys_sched_yield
0.29% libdynamorio.so [.] rwlock_wait_contended_reader
0.29% [kernel] [k] _raw_spin_lock_irqsave
0.28% libsse.so [.] FFT_Float
0.24% [kernel] [k] schedule
0.24% libdynamorio.so [.] get_dr_tls_base_addr
0.22% libdynamorio.so [.] atomic_dec_becomes_zero
0.21% perf [.] 0x00000000000a2260
0.20% libdynamorio.so [.] d_r_get_thread_id
0.20% libc-2.25.so [.] memcpy
0.19% [kernel] [k] ret_fast_syscall
0.14% libc-2.25.so [.] strcmp
0.14% [kernel] [k] __fget
0.14% perf [.] 0x000000000013491c
0.13% perf [.] 0x00000000000e481c
0.13% [kernel] [k] do_notify_resume
deadlock_avoidance_lock
only exists in debug build. Debug build does a lot of extra stuff and is expected to be slow and congested. Please evaluate release build only for performance.
Still hangs without debug flag:
Samples: 2M of event 'cpu-clock:pppH', Event count (approx.): 98419339809
Overhead Shared Object Symbol
34.06% [kernel] [k] _raw_spin_unlock_irq
19.23% libdynamorio.so [.] d_r_read_lock
11.20% [kernel] [k] cpuidle_enter_state
7.52% [kernel] [k] _raw_spin_unlock_irqrestore
6.48% [kernel] [k] el0_svc_naked
3.61% libdynamorio.so [.] dynamorio_syscall
0.98% libdynamorio.so [.] d_r_get_thread_id
0.58% [kernel] [k] fpsimd_load_state
0.45% [kernel] [k] sched_text_start
0.38% [kernel] [k] sys_sched_yield
0.36% [kernel] [k] _raw_spin_lock_irqsave
0.33% [kernel] [k] schedule
0.30% libsse.so [.] FFT_Float
0.26% perf [.] 0x000000000013491c
0.23% [kernel] [k] ret_fast_syscall
0.21% [kernel] [k] delay
0.20% [kernel] [k] local_bh_enable_ip
0.19% [kernel] [k] do_notify_resume
0.17% perf [.] 0x00000000000a2260
0.16% perf [.] 0x0000000000095108
0.15% [kernel] [k] fpsimd_restore_current_state
0.15% [snd_pcm] [k] snd_pcm_stream_unlock_irq
0.13% [kernel] [k] arm_smccc_smc
0.13% libc-2.25.so [.] strcmp
0.13% libsse.so [.] sseAnalysisGetFFTInput
0.12% libc-2.25.so [.] memcpy
0.11% libsse.so [.] StarRec_PAL_CplxFltMultSum
0.11% libdynamorio.so [.] get_dr_tls_base_addr
0.11% perf [.] 0x00000000000e481c
0.09% libsse.so [.] StarRec_PAL_CplxFltMultConjSum
0.09% perf [.] 0x00000000000e3dac
0.09% [kernel] [k] load_balance
0.09% [kernel] [k] __fget
0.09% [snd_pcm] [k] snd_pcm_stream_unlock_irqrestore
0.09% perf [.] 0x00000000000df324
0.09% libspeexdsp.so.1.5.0 [.] speex_resampler_process_int
0.08% [kernel] [k] memcpy
0.08% libpthread-2.25.so [.] pthread_mutex_lock
0.08% [kernel] [k] test_and_clear_bit
0.07% libspeexdsp.so.1.5.0 [.] 0x0000000000006bcc
0.07% libc-2.25.so [.] memset
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- myapp
Attaching with gdb shows that some threads are in dynamorio_syscall. One is in encode_opndsgen_3d000000_003fffff in instr_inline_api.h:379
During an strace run, a segfult has appeared:
DrCov internal crash at PC 0x000000007117198c. Please report this at http://dynamorio.org/issues. Program aborted. Received SIGSEGV at pc 0x000000007117198c in thread 2753 Base: 0x0000000071000000 Registers: eflags=0x0000000060000000 version 9.0.0, custom build -no_dynamic_options -client_lib '/media/mp000/DynamoRIO-AArch64-Linux-9.0.0/tools/lib64/release/libdrcov.so;0;"-dump_binary"' -client_lib64 '/media/mp000/DynamoRIO-AArch64-Linux-9.0.0/tools/lib64/release/libdrcov.so;0;"-dump_binary"' -client_lib32 '/media/mp000/DynamoRIO-AArch64-Linux-9.0.0/tools/lib32/release/libdrcov.s
Another strace run. It's still that a large number of sched_yield calls is produced from which the app apparently does not recover.
[pid 3201] <... sched_yield resumed> ) = 0
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x180200] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_read_lock+0xac) [0x3946c] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(fragment_add_ibl_target+0x444) [0x27cd4] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_dispatch+0x13d4) [0x3683c] /usr/lib/libglslcompiler.so(PostEvaluateNode+0x360) [0x4d280] /usr/lib/libglslcompiler.so(PostEvaluateNode+0x3dc) [0x4d2fc] /usr/lib/libglslcompiler.so(PostEvaluateNode+0x3dc) [0x4d2fc] /usr/lib/libglslcompiler.so(ICProcessNodeOperand+0xb8) [0x4d458] /usr/lib/libglslcompiler.so(ProcessNodeBasicOp+0x6c) [0x4e7fc] /usr/lib/libglslcompiler.so(ProcessNode+0x2b4) [0x4ed34] /usr/lib/libglslcompiler.so(ProcessNode+0xc3c) [0x4f6bc] /usr/lib/libglslcompiler.so(ICTraverseAST+0xbc) [0x5210c] /usr/lib/libglslcompiler.so(ICTraverseAST+0xe34) [0x52e84] /usr/lib/libglslcompiler.so(ICTraverseAST+0xe34) [0x52e84] /usr/lib/libglslcompiler.so(ICTraverseAST+0xe34) [0x52e84] /usr/lib/libglslcompiler.so(ICTraverseAST+0x10e4) [0x53134] /usr/lib/libglslcompiler.so(GenerateICodeProgram+0x11c) [0x551fc] /usr/lib/libglslcompiler.so(GLSLCompileToIntermediateCode+0x520) [0x5f060] /usr/lib/libglslcompiler.so(GLSLCompileToUniflex+0x240) [0x9c6a0] /usr/lib/libGLESv2.so(DoCompileShader+0x2d0) [0x610e0] /usr/lib/libGLESv2.so(RecompileShader+0x4c) [0x67c9c] /usr/lib/libGLESv2.so(GetRecompiledShaderVariantInternal+0x2fc) [0x694ec] /usr/lib/libGLESv2.so(ProgramOptimisationRecompilationTask+0x163c) [0x6cf6c] /usr/lib/libsrv_um.so(DeferredWorker+0xdc) [0x187fc] /lib/libpthread-2.25.so(pthread_get_minstack+0x13d0) [0x6ef8] /lib/libpthread-2.25.so(pthread_get_minstack+0x13d0) [0x6ef8] [pid 3197] <... sched_yield resumed> ) = 0 /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(dynamorio_syscall+0x2c) [0x180200] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_read_lock+0xac) [0x3946c] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(fragment_lookup_fine_and_coarse+0xc4) [0x27ec4] /media/mp001/DynamoRIO-AArch64-Linux-9.0.0/lib64/release/libdynamorio.so(d_r_dispatch+0x584) [0x359ec]
Btw will Dynamorio typically increase the amount of stack memory a thread needs? Because I'm tight on that.
Still hangs without debug flag:
For an actual hang, getting symbolic callstacks for every thread is best to understand what's going on (deadlock? which lock? etc.).
During an strace run, a segfult has appeared:
Symbolic callstack is again best, and once the crash details are known, if there is no existing issue, best to file a new issue.
Btw will Dynamorio typically increase the amount of stack memory a thread needs? Because I'm tight on that.
No, DR's stacks are statically sized. -stack_size controls the size.
Still hangs without debug flag:
Xref things like #5245, #5247
Does -no_ldstex2cas
make a difference?
-no_ldstex2cas didn't work.
I made another thread apply all bt full in gdb but I had to remove our symbols.
Ok, it does process marginally further on average when using the flag.
The final state is hard to capture because the system slows down until it is unusuable.
aarch64 (6 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle 31.76 0.04 45.27 7.55 0.00 15.38
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn mmcblk0 12.31 504.73 7.82 1372371 21256 mmcblk0p3 0.12 1.26 0.90 3418 2448 mmcblk0p4 0.07 0.40 0.41 1083 1104 mmcblk0p5 0.01 0.14 0.04 386 96 mmcblk0p6 5.61 327.86 0.06 891450 168 mmcblk0p8 6.49 175.05 6.41 475954 17440 dm-0 0.06 0.13 0.01 352 28 sda 0.64 5.86 0.02 15934 64 sdb 6.58 97.33 0.02 264645 48 sdb1 6.58 97.33 0.02 264637 48
Is iowait with 7.55 % normal?
It also has phases where perf top shows raw_spin_unlock_irq
with nearly 100%.
Edit: Nailed it to 4 cpus now so the system remains usable and the app completely hogs ALL of them.
newdynamoriologging.txt looks like the very first comment in this issue: one thread trying to get the fragment hashtable write lock and waiting for a bunch of readers (with a whole bunch of readers running debug-build-only code).
4, 5, 11, 43, 44, 49, 53, 61, 67, 74, 81, 87, 93, 101, 105, 111, 117, 121, 132, 141, 144, 164, 165, 168, 169,: read fragment lookup: debug-only
12, 14, 29, 39, 62: read fragment lookup: dispatch
170: write lock: fragment_add for build_basic_block_fragment
180: unlock enter_couldbelinking
So we're back to the comment above https://github.com/DynamoRIO/dynamorio/issues/4279#issuecomment-625914990 where I would re-suggest checking the exit reasons in the debug stats as this seems like a lot of time spent in dispatch if this is a period of little block building and it should be steady-state. Maybe there is some problem causing threads to exit the cache when they shouldn't and there's always a reader, blocking the rare writers who are building new blocks.
Look for the Fcache exits
stats. E.g. at the end of a thread log file:
Fcache exits, total (thread): 1422
Fcache exits, system call executions (thread): 108
Fcache exits, from traces (thread): 160
Fcache exits, from BBs (thread): 1154
Fcache exits, total indirect branches (thread): 548
Fcache exits, non-trace indirect branches (thread): 446
Fcache exits, ind target not in cache (thread): 69
...
I would also run release build with -prof_pcs
to see where time is spent to confirm it's a problem with exits to dispatch. The log dir will have output files with a summary and then detailed list. The summary is like:
ITIMER distribution (20):
70.0% of time in INDIRECT BRANCH LOOKUP (14)
30.0% of time in FRAGMENT CACHE (6)
Will a short run suffice or do I have to go to the very end? During the late stages, the machine doesn't really react anymore.
I made a run with the same app on X64 Linux desktop btw and it works without problems.
Edit: I don't use -debug flag. Why does it still run debug code? Can you tell me again the name of the debug indicating function?
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -debug -loglevel 2 -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp (code=exited, status=0/SUCCESS)
Aborted program after 2 minutes or so.
I had to be careful with proprietary info and only included the "Fcache exits" entries.
I have another run in the pipeline for the ITIMER distribution but it may take a while because the program slows down.
Ok, here we are: pcsamples.0.3804.html.zip
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp
In steady state you want at least 90-95% of time in FRAGMENT CACHE
so this profile indicates massive overhead:
ITIMER distribution (11161):
11.8% of time in INTERPRETER (1317)
80.8% of time in DISPATCH (9013)
0.1% of time in MONITOR (6)
0.5% of time in SYSCALL HANDLER (55)
0.2% of time in CONTEXT SWITCH (22)
1.7% of time in INDIRECT BRANCH LOOKUP (186)
5.0% of time in FRAGMENT CACHE (557)
0.0% of time in UNKNOWN (5)
That's for ~280K basic blocks I assume based on the stats from https://github.com/DynamoRIO/dynamorio/issues/4279#issuecomment-622561977
11161 hits @ 10ms each is 112 seconds.
During periods of a lot of new code being executed you would expect overhead and INTERPRETER
time, but not this massive DISPATCH
time.
Block building is not super fast: IIRC DR takes about 1s to build 100K blocks. So ~300K blocks might take at least 3s to build. But the rest here seems to be the blocks being shared among many threads right from the start?
I would try -thread_private
and see what happens.
Another thing to look at is KSTATS to understand where time is going. E.g., from a debug -checklevel 0
(to speed up debug build) run's log file:
Process KSTATS:
thread_measured:26504473804 totc, 17 num, 212466228 minc,1559086694 avg,23016297728 maxc, 16944740 self,26487529064 sub,
8319 ms, 0 ms out,total measured and propagated in thread
bb_building:23023119868 totc, 140634 num, 15652 minc, 163709 avg, 1901178 maxc,2302651016 self,20720468852 sub,
7226 ms, 0 ms out,in bb building
bb_decoding:2600946572 totc, 140634 num, 1872 minc, 18494 avg, 1459506 maxc,2600946572 self, 0 sub,
816 ms, 0 ms out,in bb decoding
bb_emit:15430386644 totc, 140634 num, 8546 minc, 109720 avg, 1852416 maxc, 80126850 self,15350259794 sub,
4843 ms, 0 ms out,in emitting BB
mangling:2689135636 totc, 140634 num, 944 minc, 19121 avg, 1004816 maxc,2689135636 self, 0 sub,
844 ms, 0 ms out,in mangling
emit:15350510724 totc, 140639 num, 7896 minc, 109148 avg, 1851802 maxc,15350510724 self, 0 sub,
4818 ms, 0 ms out,in emit
trace_building: 941456 totc, 19 num, 10082 minc, 49550 avg, 151324 maxc, 366008 self, 575448 sub,
0 ms, 0 ms out,in trace building
...
That's for ~280K basic blocks I assume based on the stats from #4279 (comment)
I cannot exactly reproduce the situation from an earlier run because I need to guess when to stop because it doesn't even get past startup and I need to cancel it to get usable stats.
Run with thread_private:
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -thread_private -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp
Last output from perf top before system slows down to a crisp:
Samples: 916K of event 'cpu-clock:pppH', Event count (approx.): 101452777118
Overhead Shared Object Symbol
31.14% [kernel] [k] cpuidle_enter_state
14.65% libdynamorio.so [.] d_r_mutex_lock
6.06% [kernel] [k] _raw_spin_unlock_irq
1.65% [kernel] [k] _raw_spin_unlock_irqrestore
1.15% libdynamorio.so [.] d_r_read_lock
0.76% [kernel] [k] el0_svc_naked
0.69% libdynamorio.so [.] instr_writes_to_exact_reg
0.52% [snd_pcm] [k] snd_pcm_stream_unlock_irq
0.45% libdynamorio.so [.] hashtable_fragment_lookup.isra.0
0.40% libdynamorio.so [.] instr_reg_in_src
0.39% perf [.] 0x0000000000095108
0.37% [kernel] [k] arm_smccc_smc
0.34% libdynamorio.so [.] dynamorio_syscall
0.33% libjvm.so [.] 0x00000000009d2e08
0.25% libdrreg.so [.] drreg_event_bb_analysis
0.25% libdynamorio.so [.] opnd_uses_reg
0.25% libc-2.25.so [.] memcpy
0.25% [kernel] [k] delay
0.24% libsse.so [.] FFT_Float
0.23% perf [.] 0x0000000000095104
0.23% libdrreg.so [.] drvector_set_entry
0.23% libdynamorio.so [.] acquire_recursive_lock
0.22% libdynamorio.so [.] instr_reads_from_reg
0.21% libdynamorio.so [.] decode_common
I will try to produce logs after successful abort.
I ran the app with the following command line
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -debug -checklevel 0 -thread_private -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp
But none of the logs seems to include a KSTATS parameter. pcsamples is even empty. As usual, it was very hard to end the program, it took 15 minutes after my CTRL-C.
The pcsamples and the process-wide KSTATS are printed at the end. They should still be printed on any app exit (including crash) though; probably not on a DR crash since it assumes it can't trust its own internal state.
The other thing is that there are lock contention stats in the debug logs which can help figure out which locks are having problems:
25 lock 0x00007f589ef38420: name=bb_building_lock(mutex)@/home/bruening/dr/git/src/core/arch/interp.c:119
rank=5 owner=0 owning_dc=0x0000000000000000 contended_event=0xffffffff prev=0x00007f569e9a46c0
lock count_times_acquired= 641 count_times_contended= 43 0 count_times_spin_only= 30 max_contended_requests= 1+2 bb_building_lock(mutex)@/home/bruening/dr/git/src/core/arch/interp.c:119
I would look at the pcsamples for -thread_private
and would expect the dispatch problem to have disappeared, and that lock contention has also disappeared as a problem, maybe replaced by memory usage and duplication of work building the same blocks for each thread.
I thought there was an explicit issue on possibly splitting the bb_building_lock up but I can't find it now. When we first created thread-shared caches we started with simple global locks and split up the ones that had contention issues, and were surprised to learn that keeping a single global bb_building_lock worked well for all our workloads. Maybe that is not the case here: but it seems best to prove to ourselves that this is indeed the problem. Adding the lock contention stats to the above callstacks piling up on the lock would help there, along with the thread_private stats. Having thread_private also perform horribly does not help the theory that it's the global locks: unless we can prove that the problems with thread_private are something unique to that mode (otherwise there could be some other issue that is shared across private and shared causing all the problems).
From the callstacks it wasn't nec the bb_building_lock contention: it was the shared block hashtable. Though if the underlying problematic behavior here is that all threads run the exact same large set of new code at the same time, if they get through their hashtable lookups, they'll all want to build the same blocks next. But, maybe removing this builder waiting for all the readers would still be an improvement, even if the underlying issue of widely shared initialization code will always have some overhead.
Re: the prior comment: but -thread_private should get rid of that: so understanding why -thread_private has just as much overhead seems key to forward progress.
A screenshot of a perf trace. Seems that it spends most of the time in d_r_dispatch itself. Can we be sure it enters that loop only once most of the time?
I assume you're still looking at thread-shared. Please analyze thread-private per the discussion as it being slow is confusing, Repeating earlier questions:
No, the recent runs all use thread-private. I have problems producing another pcsamples output because the program slows down so much that I cannot shut it down properly.
I believe that stack trace in https://github.com/DynamoRIO/dynamorio/issues/4279#issuecomment-1020278239 with fragment_lookup_type
acquiring a lock is impossible for -thread_private
: look at the code https://github.com/DynamoRIO/dynamorio/blob/master/core/fragment.c#L2625. -shared_traces
and -shared_bbs
are false for -thread_private
. If you could verify that -thread_private
is actually set for that run, and if so, please debug what is going on, since it seems impossible.
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -thread_private -prof_pcs -prof_pcs_heap_size 240000 -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp
This command line was used for #4279.
I set a breakpoint in gdb at fragment.c:2625 and it stops regularly at fragment.c:2718 probably because fragment_lookup_type is static. Can you name me a line in which it should absolutely not stop if thread_private does what it should?
Edit: When I attach with gdb I almost immediately get a segfault here:
https://github.com/DynamoRIO/dynamorio/blob/master/core/utils.c#L1109
at /home/runner/work/dynamorio/dynamorio/core/utils.c:1109
No locals.
lookup_flags=15,
tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:2700
f = <optimized out>
tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>,
dcontext=0x7da0808080)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:2625
f = <optimized out>
tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>,
dcontext=0x7da0808080)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:2718
No locals.
tag=0x7da03a83d4 <error: Cannot access memory at address 0x7da03a83d4>,
wrapper=wrapper@entry=0x7da082efc8, last_exit=0x7daf3bd758)
at /home/runner/work/dynamorio/dynamorio/core/fragment.c:8066
res = <optimized out>
---Type
at /home/runner/work/dynamorio/dynamorio/core/dispatch.c:1453
next_f = <optimized out>
last_f = <optimized out>
coarse_f = <error reading variable coarse_f (Cannot access memory at address 0x7da082efc8)>
at /home/runner/work/dynamorio/dynamorio/core/dispatch.c:880
wherewasi = DR_WHERE_FCACHE
at /home/runner/work/dynamorio/dynamorio/core/dispatch.c:160
targetf = <optimized out>
coarse_f = <error reading variable coarse_f (Cannot access memory at address 0x7da082efc8)>
No symbol table info available.
No symbol table info available.
As mentioned, -shared_taces
and -shared_bbs
are off, so it shouldn't go into the if
statements in fragment_lookup_type
at lines 2682 or 2696. It shouldn't call d_r_read_lock
at all from fragment_lookup_type
. That's the whole point of thread-private: there are no locks for routine operations on private code caches.
Read edit of last comment.
Did you sanity-check the options? Print dynamo_options.thread_private
and dynamo_options.shared_bbs
. Is this the same pid as the drrun process, or some child or grandchild? Is there a config file for the children overriding the parent options?
The application uses pthreads with LWP. Each thread has its own LWP. I did not change any of the dynamorio config files if you mean that. I shall do another run to read out the variables.
BTW: Is it possible that dynamorio has problems with GDB? Valgrind has.
Ok, this looks interesting: Setting a breakpoint in fragment.c:2718 indicates both dynamo_options.thread_private and dynamo_options.shared_bbs being true.
/media/mp001/DynamoRIO-AArch64-Linux-9.0.0/bin64/drrun -root /media/mp001/DynamoRIO-AArch64-Linux-9.0.0 -thread_private -t drcov -logdir /media/mp001/dynamorio -dump_binary -- /usr/bin/myapp
/ PR 361894: if no TLS available, we fall back to thread-private / PC_OPTION_DEFAULT(bool, shared_bbs, IF_HAVE_TLS_ELSE(true, false), "use thread-shared basic blocks")
Are you sure shared_bbs is false by default? My Aarch64 CPU has TLS.
Edit: Adding -no_shared_bbs and dynamo_options.shared_bbs is still true in 2718.
-shared_bbs
is true by default, but -thread_private
turns it off.
I see this issue: #1884. Looks like nobody ever added private cache support to arm or aarch64. I assume debug build gives an error message and all the above runs trying to enable -thread_private
were release build which corrects some options and continues (from old days of running Windows services where a fatal error in release build for a misspelled option to bring down the machine was considered unacceptable; maybe should be changed now).
Do you have an explanation for the Segfaults that occur only when I use GDB?
Did you rule out safe reads and other expected behavior? https://dynamorio.org/page_debugging.html#autotoc_md140
Yeah, that's probably it.
Recently there were several fixes that might have an impact here, especially those adding missing load-acquire barriers in mutex spinlock code and release-build statistics code which could explain slowdowns and hangs. Could you try your app with PR #5367, PR #5370, and PR #5375?
I'm out of office for the weekend. But I'll happily try on monday. Can you tell me when the next weekly build will arrive?
Can you tell me when the next weekly build will arrive?
Friday evenings -- so conveniently tonight's will include all 3 of those fixes.
I tested DynamoRIO-AArch64-Linux-8.0.0-1 and drcov with a large application of ca. 150 threads. Yocto Linux. The program seems to get stuck at certain points. Attaching with gdb shows that the program still runs but contends a mutex that has to do with fragment writing/reading:
DynamoRIO-AArch64-Linux-8.0.0-1
The readers ( dozens ):
The writer who slows everything down:
Also, strace shows an above average time spent in sched_yield.
Derek Bruening told me to do some profiling in https://github.com/DynamoRIO/dynamorio/issues/4278 Please tell me what you need to know.