DynamoRIO / drmemory

Memory Debugger for Windows, Linux, Mac, and Android
Other
2.43k stars 262 forks source link

[drheap] online symbolization is a huge performance bottleneck for large apps #1447

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

From bruen...@google.com on February 20, 2014 16:30:57

\ TODO why is drheapstat so much slower than drmem full mode?

After adding online symbolization ( issue #1308 ) drheap has gotten slow: xref issue #1309 where it hit the app_suite timeout. Here it's very slow on chromium tests.

% /usr/bin/time ~/drmemory/git/build_drheap_rel/bin/drheapstat.pl -dr d:/derek/dr/git/exports -- ./net_unittests.exe --gtest_filter=BitmapTest.Basics --ui-test-action-timeout=12000000 --ui-test-action-max-timeout=28000000 --ui-test-terminate-timeout=12000000

0.15user 0.35system 1:48.26elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k

vs drmemory -light w/o symcache: 0.00user 0.01system 0:12.07elapsed 0%CPU (0avgtext+0avgdata 262656maxresident)k w/ symcache: 0.00user 0.00system 0:02.15elapsed 0%CPU (0avgtext+0avgdata 262912maxresident)k full mode w/ symcache: 0.00user 0.01system 0:10.57elapsed 0%CPU (0avgtext+0avgdata 262912maxresident)k

back to drheap: w/ symcache: 0.16user 0.24system 1:27.72elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k -no_staleness 0.19user 0.21system 1:20.49elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k -no_check_leaks -no_staleness 0.18user 0.28system 1:19.61elapsed 0%CPU (0avgtext+0avgdata 9764864maxresident)k -callstack_max_frames 4 -no_check_leaks -no_staleness 0.20user 0.33system 0:20.12elapsed 2%CPU (0avgtext+0avgdata 9764864maxresident)k

*\ DONE analyze in CodeAnalyst => it's all in symbolizing the (unique) malloc callstacks CLOSED: [2014-02-20 Thu 16:02]

data on -callstack_max_frames 4 -no_check_leaks -no_staleness:

Somehow there are two net_unittests.exe: ok, it's b/c I don't have RUNNING_ON_VALGRIND=1

Process Name 64-bit Timer samples
dbghelp.dll 66.93
msvcrt.dll 11.69
unknown module pid (68884) 1 8.77
ntoskrnl.exe 1 5.09
drheapstat.dll 4
ntdll.dll 0.95
6 modules, Total: 21995 samples, 97.43% of shown samples

dbghelp: CS:EIP Symbol + Offset Timer samples
0x6622d650 MiniDumpReadDumpStream 98.16
0x661fcd70 GetTimestampForLoadedLibrary 1.09
0x66210a40 ImagehlpApiVersionEx 0.2
0x6620c8d0 SymEnumSourceFilesW 0.04
0x661ebb8d NO SYMBOL 0.03
0x66206d00 SymGetLineFromAddrW64 0.03
6 functions, 1315 instructions, Total: 15159 samples, 99.55% of shown samples, 1.00% of total session samples

msvcrt.dll: all in bsearch, called from MiniDumpReadDumpStream

10449 callstacks

removing the packed_callstack_print() line: -callstack_max_frames 4 -no_check_leaks -no_staleness 0.19user 0.27system 0:02.39elapsed 19%CPU (0avgtext+0avgdata 9764864maxresident)k -no_check_leaks -no_staleness 0.16user 0.25system 0:03.27elapsed 13%CPU (0avgtext+0avgdata 9830400maxresident)k -no_staleness 0.10user 0.30system 0:03.49elapsed 11%CPU (0avgtext+0avgdata 9830400maxresident)k defaults 0.15user 0.30system 0:06.15elapsed 7%CPU (0avgtext+0avgdata 9830400maxresident)k

So symbolizing those 10K callstacks takes 81 seconds. Wow.

*\ TODO breakdown w/o packed_callstack_print()

this is -no_check_leaks -no_staleness

Process Name 64-bit Timer samples
dynamorio.dll 49.27
drheapstat.dll 27.96
ntoskrnl.exe 1 11.09
unknown module pid (16424) 1 5.72
dbghelp.dll 2.25
ntdll.dll 1.59
fltmgr.sys 1 0.84
7 modules, Total: 3294 samples, 98.71% of shown samples

CS:EIP Symbol + Offset Timer samples
0x65a91a80 ulong_to_str 35.71
0x65a767b0 mutex_lock 7.79
0x65a92030 our_vsnprintf 7.6
0x65a747f0 release_recursive_lock 4.32
0x65ac1781 safe_read_asm_mid 2.8
0x65a74740 mutex_unlock 2.25
6 functions, 115 instructions, Total: 994 samples, 60.46% of shown samples, 0.45% of total session samples

CS:EIP Symbol + Offset Timer samples
0x73812f30 find_next_fp 55.09
0x73812480 is_in_module 7.07
0x73815e20 crc32_whole_and_half 6.65
0x73801550 copy_snapshot 4.5
0x73815c10 rb_in_node 2.25
0x73812800 address_to_frame 2.04
0x73813810 print_callstack 2.04
0x7382e920 drmgr_bb_event 1.93
0x738123e0 module_lookup 1.29
9 functions, 118 instructions, Total: 773 samples, 82.85% of shown samples, 0.35% of total session samples

CSS: Name Address Self Children Total
D:...\build_drheap_rel\drheapstat.dll 0x73800000 99 1134 1233
client_add_malloc_pre 0x738019d0 0 99 99
notify_client_alloc 0x7380bf70 0 98 98
packed_callstack_record 0x73813da0 0 88 88
replace_alloc_common 0x7380cd10 0 88 88
replace_malloc 0x7380f0d0 0 82 82
print_callstack 0x73813810 1 80 81
dump_snapshot 0x73801150 2 77 79
7 functions, Total: 3 self samples, 612 child samples

ulong_to_str: called from our_vsnprintf from dump_snapshot

binary format would speed things up => issue #1332 *\ TODO try caching lookups?

opposite direction of symcache

*\ TODO switch back to offline symbolization?

*\ TODO or have a sideline thread, instead of the old process idea, symbolize

though note that here the app would finish in 6 seconds and then would wait over a minute for the sidline thread to finish. could parallelize into multiple threads, at the risk of more impact on the app.

from a usage point of view, if I have to wait a full minute, knowing what it's doing is nice: so a separate step I invoked seems better. offline can also be heavily parallelized more easily. plus, here, the user is going to launch a visualizer anyway, so there's already a postprocess step, unlike Dr. Memory.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=1447

derekbruening commented 9 years ago

From bruen...@google.com on February 20, 2014 13:46:27

*\ TODO try caching lookups?

opposite direction of symcache

on this net_ run, we have:

grep -E -o '(0x........ ' logs/DrHeapstat-net_unittests.exe.14048.000/callstack.log | wc

376123 376123 4889599

grep -E -o '(0x........ ' logs/DrHeapstat-net_unittests.exe.14048.000/callstack.log | sort | uniq -c | wc

11059 22118 232239

grep -E -o '(0x........ ' logs/DrHeapstat-net_unittests.exe.14048.000/callstack.log | sort | uniq -c | sort -n

  1 (0x00d11212 
  1 (0x00d11370 
  1 (0x00d120b3 
  1 (0x00d1302f 
  1 (0x00d13079 
  1 (0x00d1720f 
...

1700 (0x017443f8 1706 (0x0129ae52 2060 (0x017f2957 2273 (0x0174af92 2273 (0x0174c574 2532 (0x0174447c 2532 (0x01744cd9 2532 (0x01744cfc 2532 (0x01744d0d 2532 (0x0174af77 2532 (0x0174c5a9 3297 (0x0174c537 3360 (0x01745128 3419 (0x00e23474 4858 (0x0174c5d7 8582 (0x01745121 9345 (0x00d1266e 9887 (0x00d12561 11949 (0x0174c652 19343 (0x00d121cf 31915 (0x017f28f9 33593 (0x7062233b 33736 (0x7380f0d0 33979 (0x7674336a 33979 (0x77879f45 33979 (0x77879f72

There is a long tail: a lot of singletons.

So the majority would be hits in the cache, but maybe not a high enough rate.

If we stored everything in an in-memory non-persistent cache, we'd go from 376K to 11K queries => 2.4s instead of 81s, plus overhead for the cache itself. Say 3s. That's on top of a 6s runtime: still 50% overhead. Can we live w/ that, or just go w/ offline?