[drheap] online symbolization is a huge performance bottleneck for large apps

From bruen...@google.com on February 20, 2014 16:30:57

\ TODO why is drheapstat so much slower than drmem full mode?

After adding online symbolization ( issue #1308 ) drheap has gotten slow: xref issue #1309 where it hit the app_suite timeout. Here it's very slow on chromium tests.

% /usr/bin/time ~/drmemory/git/build_drheap_rel/bin/drheapstat.pl -dr d:/derek/dr/git/exports -- ./net_unittests.exe --gtest_filter=BitmapTest.Basics --ui-test-action-timeout=12000000 --ui-test-action-max-timeout=28000000 --ui-test-terminate-timeout=12000000

0.15user 0.35system 1:48.26elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k

vs drmemory -light w/o symcache: 0.00user 0.01system 0:12.07elapsed 0%CPU (0avgtext+0avgdata 262656maxresident)k w/ symcache: 0.00user 0.00system 0:02.15elapsed 0%CPU (0avgtext+0avgdata 262912maxresident)k full mode w/ symcache: 0.00user 0.01system 0:10.57elapsed 0%CPU (0avgtext+0avgdata 262912maxresident)k

back to drheap: w/ symcache: 0.16user 0.24system 1:27.72elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k -no_staleness 0.19user 0.21system 1:20.49elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k -no_check_leaks -no_staleness 0.18user 0.28system 1:19.61elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k -callstack_max_frames 4 -no_check_leaks -no_staleness 0.20user 0.33system 0:20.12elapsed 2%CPU (0avgtext+0avgdata 9764864maxresident)k

*\ DONE analyze in CodeAnalyst => it's all in symbolizing the (unique) malloc callstacks CLOSED: [2014-02-20 Thu 16:02]

State "DONE" from "TODO" [2014-02-20 Thu 16:02]

data on -callstack_max_frames 4 -no_check_leaks -no_staleness:

Somehow there are two net_unittests.exe: ok, it's b/c I don't have RUNNING_ON_VALGRIND=1

Process Name 64-bit Timer samples
dbghelp.dll 66.93
msvcrt.dll 11.69
unknown module pid (68884) 1 8.77
ntoskrnl.exe 1 5.09
drheapstat.dll 4
ntdll.dll 0.95
6 modules, Total: 21995 samples, 97.43% of shown samples

dbghelp: CS:EIP Symbol + Offset Timer samples
0x6622d650 MiniDumpReadDumpStream 98.16
0x661fcd70 GetTimestampForLoadedLibrary 1.09
0x66210a40 ImagehlpApiVersionEx 0.2
0x6620c8d0 SymEnumSourceFilesW 0.04
0x661ebb8d NO SYMBOL 0.03
0x66206d00 SymGetLineFromAddrW64 0.03
6 functions, 1315 instructions, Total: 15159 samples, 99.55% of shown samples, 1.00% of total session samples

msvcrt.dll: all in bsearch, called from MiniDumpReadDumpStream

10449 callstacks

removing the packed_callstack_print() line: -callstack_max_frames 4 -no_check_leaks -no_staleness 0.19user 0.27system 0:02.39elapsed 19%CPU (0avgtext+0avgdata 9764864maxresident)k -no_check_leaks -no_staleness 0.16user 0.25system 0:03.27elapsed 13%CPU (0avgtext+0avgdata 9830400maxresident)k -no_staleness 0.10user 0.30system 0:03.49elapsed 11%CPU (0avgtext+0avgdata 9830400maxresident)k defaults 0.15user 0.30system 0:06.15elapsed 7%CPU (0avgtext+0avgdata 9830400maxresident)k

So symbolizing those 10K callstacks takes 81 seconds. Wow.

*\ TODO breakdown w/o packed_callstack_print()

this is -no_check_leaks -no_staleness

Process Name 64-bit Timer samples
dynamorio.dll 49.27
drheapstat.dll 27.96
ntoskrnl.exe 1 11.09
unknown module pid (16424) 1 5.72
dbghelp.dll 2.25
ntdll.dll 1.59
fltmgr.sys 1 0.84
7 modules, Total: 3294 samples, 98.71% of shown samples

CS:EIP Symbol + Offset Timer samples
0x65a91a80 ulong_to_str 35.71
0x65a767b0 mutex_lock 7.79
0x65a92030 our_vsnprintf 7.6
0x65a747f0 release_recursive_lock 4.32
0x65ac1781 safe_read_asm_mid 2.8
0x65a74740 mutex_unlock 2.25
6 functions, 115 instructions, Total: 994 samples, 60.46% of shown samples, 0.45% of total session samples

CS:EIP Symbol + Offset Timer samples
0x73812f30 find_next_fp 55.09
0x73812480 is_in_module 7.07
0x73815e20 crc32_whole_and_half 6.65
0x73801550 copy_snapshot 4.5
0x73815c10 rb_in_node 2.25
0x73812800 address_to_frame 2.04
0x73813810 print_callstack 2.04
0x7382e920 drmgr_bb_event 1.93
0x738123e0 module_lookup 1.29
9 functions, 118 instructions, Total: 773 samples, 82.85% of shown samples, 0.35% of total session samples

CSS: Name Address Self Children Total
D:...\build_drheap_rel\drheapstat.dll 0x73800000 99 1134 1233
client_add_malloc_pre 0x738019d0 0 99 99
notify_client_alloc 0x7380bf70 0 98 98
packed_callstack_record 0x73813da0 0 88 88
replace_alloc_common 0x7380cd10 0 88 88
replace_malloc 0x7380f0d0 0 82 82
print_callstack 0x73813810 1 80 81
dump_snapshot 0x73801150 2 77 79
7 functions, Total: 3 self samples, 612 child samples

ulong_to_str: called from our_vsnprintf from dump_snapshot

binary format would speed things up => issue #1332 *\ TODO try caching lookups?

opposite direction of symcache

*\ TODO switch back to offline symbolization?

*\ TODO or have a sideline thread, instead of the old process idea, symbolize

though note that here the app would finish in 6 seconds and then would wait over a minute for the sidline thread to finish. could parallelize into multiple threads, at the risk of more impact on the app.

from a usage point of view, if I have to wait a full minute, knowing what it's doing is nice: so a separate step I invoked seems better. offline can also be heavily parallelized more easily. plus, here, the user is going to launch a visualizer anyway, so there's already a postprocess step, unlike Dr. Memory.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=1447

DynamoRIO / drmemory

[drheap] online symbolization is a huge performance bottleneck for large apps #1447

grep -E -o '(0x........ ' logs/DrHeapstat-net_unittests.exe.14048.000/callstack.log | wc

grep -E -o '(0x........ ' logs/DrHeapstat-net_unittests.exe.14048.000/callstack.log | sort | uniq -c | wc

grep -E -o '(0x........ ' logs/DrHeapstat-net_unittests.exe.14048.000/callstack.log | sort | uniq -c | sort -n