improve performance of pointer scan for leak reachability analysis

derekbruening commented 10 years ago

From derek.br...@gmail.com on December 10, 2010 17:58:04

PR 475518

my initial implementation of scanning for pointers for reachability analysis produces a noticeable 1-second delay at app exit (and at nudge time for PR 428709). this case covers trying to speed that up. it's not identifying the defined region that's slow, it's the pointer scan itself.

adding to this case a FIXME from my code that may improve perf noticeably: skipping regions that have been read-only since being loaded (.text, .rodata, etc.). we'll need extra tracking since today we can't tell whether they were made writable and then changed back to read-only. we could also skip writable regions that have not been written to since being loaded since they can't point into the heap. technically the app could write a heap pointer to a file and then mmap it in but we'll ignore that.

my initial imp is going to skip r-x regions. I'm not going to skip all non-writable though. this case covers adding modified-since-load info to both avoid false neg w/ r-x and to skip more non-writable.

it does take substantial time on large apps. we should probably fix up -no_count_leaks and document it, and perhaps make it default so that nudges can be used for just updating the error summary w/o waiting 8 minutes? just turning it off today => bugs: maybe somebody else needs op_record_allocs in alloc.c. we still record for pre_us.

for PR 485354 I have the leak scan on windows skipping read-only image regions: once this case checks for history that can be done for *nix too

Server: perforce-panda.eng.vmware.com:1985

PR 520916: leak-check-only mode without losing accuracy

added -leaks_only option that does not do any shadowing
added -shadowing option as an internal debugging method for disabling everything except malloc instrumentation. it could be more thorough (e.g., don't need known_table, etc.) but should prove useful. this is what my original -leaks_only experiment used so I kept the code under -no_shadowing when I added stack zeroing.
split osshared{pre,post}_syscall() from osshadow{pre,post}_syscall() for more than just options.shadowing: also for PR 536878 where drheapstat will need fork-following
to rule out beyond-TOS, needs xsp in each thread at exit time: for that I needed PR 536058 in DR (diff sent earlier, included in this tree since needed to test)
to rule out stale stack entries, added zeroing on stack allocation by adapting drmem's adjust_esp shadow code
for locating live mallocs inside an unmapped heap arena, which is done using shadow info for regular drmem, I'm brute-force looking up every 8 bytes: but only on Windows since I'm assuming glibc malloc never unmaps an arena that contains live mallocs. xref PR 535568's attempt to replace the malloc table w/ an interval tree: too expensive!
added a -leaks-only test

PR 485354: # possible leaks nondet on nudge test on Windows My leaks-only test hit PR 485354 so I investigated:

it's coming from .pdata RVA entries
I have windows scans skipping read-only image sections now: PR 475518 covers doing that properly and on both plaforms by monitoring history

The beyond-TOS + zero-on-stack-alloc seems to be working in practice as far as I can tell (apps aren't exactly deterministic but results seem to match pretty well w/ full shadowing). Zeroing is not 100% transparent but it's close enough for me. We'll need more data on more apps but I think it's going to work out, and the overhead should be a strict subset of my original shadow-writes proposal. Unfortunately the overhead is still significant: 2.2x on crafty (vs 1.6x for DrHeapstat).

xref PR 536878: add leak checking feature to Dr. Heapstat a la Dr. Memory -leaks_only

future work under PR 539395: improve accuracy of -leaks_only:

issues w/ ignoring beyond-TOS w/ alternate stacks:
- if currently on alt sigstack will look beyond TOS on main stack
- if currently on main stack will look at whole alt stack could intercept signal + sigreturn.
are there uninit fields in kernel stack allocs for signal/Ki?
use an interval tree of mallocs (despite perf cost: see PR 535568) or some other way to improve accuracy and efficiency of locating live mallocs in an unmapped arena

Original issue: http://code.google.com/p/drmemory/issues/detail?id=151

derekbruening commented 10 years ago

From bruen...@google.com on January 12, 2012 11:37:45

wow looks like I pasted too much stuff into the initial entry text above

xref issue #568 (parallelize leak scan) xref issue #711 (improve callstack walk)

\ INFO breakdown of -leaks_only -no_zero_stack on ui_tests

this is NPAPITesterBase.NPObjectProxy

Process Name 64-bit Timer samples
ntoskrnl.exe 1 45.35
dynamorio.dll 22.47
drmemorylib.dll 12.31
dbghelp.dll 10.52
ntdll.dll 3.04
wow64.dll 1 2.89
wow64cpu.dll 1 2.26
Parity.sys 1 0.4
unknown module pid (6804) 1 0.27
9 modules, Total: 19377 samples, 99.51% of shown samples

CS:EIP Symbol + Offset Timer samples
0x2903a74 PsReturnProcessPagedPoolQuota 50.04
0x289b228 PsIsProtectedProcess 24.45
0x28cb490 KeSynchronizeExecution 10.79
0x2baa350 ObReferenceObjectByHandle 8.04
0x2916c24 KeQueryPriorityThread 1.12
0x28c65f0 memmove 0.88

CS:EIP Symbol + Offset Timer samples
0x6fe9b780 mutex_unlock 15.34
0x6fe9b740 mutex_trylock 14.65
0x6fee2470 dr_mutex_lock 12.55
0x6fee24d0 dr_mutex_unlock 11.06
0x6fe9d460 mutex_lock 10.17
0x6fe611f4 dynamorio_syscall_wow64 6.17
0x6fee7290 get_thread_private_dcontext 2.9
0x6fef70b0 query_virtual_memory 2.67
0x6fea24c0 common_heap_alloc 1.51
0x6fee7b70 query_memory_ex 1.28

CS:EIP Symbol + Offset Timer samples
0x6ffe7340 rb_in_node 59.78
0x6ffe5160 check_reachability_pointer 7.55
0x6ffee6c0 hashtable_lookup 7.05
0x6ffee4b0 hash_key 6.05
0x6ffd90c0 heap_region_bounds 6.01
0x6ffe5310 check_reachability_helper 3.75
0x6ffd6cf0 get_decoded_ptr 2.21
0x6ffee00a dr_mutex_unlock 1.67
0x6ffee010 dr_mutex_lock 1.63

CS:EIP Symbol + Offset Timer samples
0x54d730 MiniDumpReadDumpStream 99.51

analysis:

rb_in_node does highlight tree lookup as potential issue
however, rb_in_node is 60% of 12% => 7% total. kernel is 45%. I'm not sure if the symbols in the shown breakdown are useful b/c don't have private syms. TLB miss is handled in hardware for x86 so would only be in kernel on page miss. I'm wondering if most of this is page handling during the scan.

parallelizing seems most promising for large enough apps: no synch is needed on malloc lookup since read-only at that point. synch is needed when updating leak data structs but common case is a non-heap-pointer during scan.

Owner: bruen...@google.com
Labels: -Priority-Low -Type-Defect Priority-Medium Type-Enhancement Component-LeakCheck Performance

derekbruening commented 10 years ago

From bruen...@google.com on January 13, 2012 07:25:18

running entire unit_tests suite (minus tools/valgrind exclusions): RUN: -no_check_uninitialized [==========] 3554 tests from 577 test cases ran. (1636274 ms total) [==========] 3554 tests from 577 test cases ran. (1479432 ms total) [==========] 3554 tests from 577 test cases ran. (1518962 ms total) RUN: -no_check_uninitialized -no_leak_scan [==========] 3554 tests from 577 test cases ran. (1370891 ms total) [==========] 3554 tests from 577 test cases ran. (1409782 ms total) [==========] 3554 tests from 577 test cases ran. (1382568 ms total) RUN: -no_check_uninitialized -no_count_leaks [==========] 3554 tests from 577 test cases ran. (1198297 ms total) [==========] 3554 tests from 577 test cases ran. (1209558 ms total) [==========] 3554 tests from 577 test cases ran. (1260151 ms total)

getting the average in mins:

25.7482 -no_check_uninitialized 23.1291 -no_check_uninitialized -no_leak_scan 20.3778 -no_check_uninitialized -no_count_leaks => scan: 2.6 mins rest of leaks: 2.8 mins leak total: 5.4 mins = 26% perf hit

I wonder how many sub-processes there are, for the scan to take that long: bears investigation whether that's all on the final scan.

Suite log dir shows 9 sets of logdirs which makes sense since 3x3 runs. Each run has 10 unit_tests.exe logdirs. So it's 10 scans total that are taking the 2.6 mins. Probably not all equal, but average of 16s each which seems more reasonable.

derekbruening commented 10 years ago

From bruen...@google.com on January 13, 2012 07:33:39

leak overhead %scan %rest ui_tests 40% 78% 22% unit_tests 26% 48% 52%

derekbruening commented 10 years ago

From bruen...@google.com on January 17, 2012 08:25:38

more runs over weekend on windesk for some reason has -no_check_uninitialized slower: RUN: -no_check_uninitialized [==========] 3554 tests from 577 test cases ran. (1815211 ms total) [==========] 3554 tests from 577 test cases ran. (1891468 ms total) [==========] 3554 tests from 577 test cases ran. (1552920 ms total) RUN: -no_check_uninitialized -no_leak_scan [==========] 3554 tests from 577 test cases ran. (1450673 ms total) [==========] 3554 tests from 577 test cases ran. (1442520 ms total) [==========] 3554 tests from 577 test cases ran. (1438121 ms total) RUN: -no_check_uninitialized -no_count_leaks [==========] 3554 tests from 577 test cases ran. (1249905 ms total) [==========] 3554 tests from 577 test cases ran. (1222006 ms total) RUN: -no_check_uninitialized -no_count_leaks -no_use_symcache [==========] 3554 tests from 577 test cases ran. (3377504 ms total) [==========] 3554 tests from 577 test cases ran. (3350393 ms total) [==========] 3554 tests from 577 test cases ran. (3332369 ms total) RUN: -leaks_only [==========] 3554 tests from 577 test cases ran. (1090167 ms total) [==========] 3554 tests from 577 test cases ran. (1099449 ms total) [==========] 3554 tests from 577 test cases ran. (1095470 ms total) RUN: -leaks_only -no_zero_stack [==========] 3554 tests from 577 test cases ran. (1072919 ms total) [==========] 3554 tests from 577 test cases ran. (1077422 ms total) [==========] 3554 tests from 577 test cases ran. (1064477 ms total) RUN: -leaks_only -no_count_leaks [==========] 3554 tests from 577 test cases ran. (865069 ms total) [==========] 3554 tests from 577 test cases ran. (864243 ms total) [==========] 3554 tests from 577 test cases ran. (860628 ms total) RUN: -leaks_only -no_count_leaks -no_trackallocs [==========] 3554 tests from 577 test cases ran. (744878 ms total) [==========] 3554 tests from 577 test cases ran. (769345 ms total) [==========] 3554 tests from 577 test cases ran. (748245 ms total) RUN: native [==========] 3554 tests from 577 test cases ran. (704779 ms total) [==========] 3554 tests from 577 test cases ran. (580533 ms total) [==========] 3554 tests from 577 test cases ran. (572552 ms total) % perl -e 'while(){if(/^RUN: (.)/){$nr=$1;if($n>0){printf"%5.2f %-30s\n",$t/(601000$n),$run;}$run=$nr;$n=0;$t=0;}elsif(/(\d+) ms/){$t+=$1;$n++;}}if($n>0){printf"%5.2f %-30s\n",$t/(60_1000*$n),$run;}' 29.22 -no_check_uninitialized
24.06 -no_check_uninitialized -no_leak_scan 20.60 -no_check_uninitialized -no_count_leaks 55.89 -no_check_uninitialized -no_count_leaks -no_use_symcache 18.25 -leaks_only
17.86 -leaks_only -no_zero_stack
14.39 -leaks_only -no_count_leaks
12.57 -leaks_only -no_count_leaks -no_track_allocs 10.32 native

derekbruening commented 10 years ago

From bruen...@google.com on January 17, 2012 08:30:15

I ran like so:

for j in "-no_check_uninitialized" "-no_check_uninitialized -no_leak_scan" "-no_check_uninitialized -no_count_leaks" "-no_check_uninitialized -no_count_leaks -no_use_symcache" "-leaks_only" "-leaks_only -no_zero_stack" "-leaks_only -no_count_leaks" "-leaks_only -no_count_leaks -no_track_allocs"; do echo "RUN: $j" | tee -a OUT for ((i=0; i<3; i++)); do ~/drmemory/git/build_drmem_rel/bin/drmemory.exe $j -suppress 'e:\src\chromium\src\tools\valgrind\DrMemory\suppressions.txt' -quiet -batch -no_results_to_stderr -dr e:/src/dr/git/exports -- ./unit_tests.exe --gtest_filter="-P2PTransportImplTest.FAILS_SendDataTcp:ServiceProcessStateTest.FAILS_ForceShutdown:FirefoxImporterTest.FirefoxNSS3Decryptor:DownloadManagerTest.FLAKY_StartDownload:P2PTransportImplTest.Create:DownloadManagerTest.StartDownload:DownloadManagerTest.FAILS_StartDownload:P2PTransportImplTest.FLAKY_SendDataUdp:VisitedLinkRelayTest.Basics:ConnectionTesterTest.FLAKY_RunAllTests:MultiProcessLockTest.FAILS_RecursiveLock:SignedSettingsTest.FLAKY_StorePolicyNoPolicyData:ConnectionTesterTest.FAILS_RunAllTests:PredictorTest.FAILS_MassiveConcurrentLookupTest:P2PTransportImplTest.FAILS_ConnectTcp:RenderViewTest.ImeComposition:VisitedLinkEventsTest.FLAKY_Coalescense:VideoCaptureHostTest.:P2PTransportImplTest.FLAKY_SendDataTcp:P2PTransportImplTest.FAILS_ConnectUdp:ProcessWatcherTest.FLAKY_ImmediateTermination:P2PTransportImplTest.FAILS_Create:ProcessWatcherTest.ImmediateTermination:AudioRendererHostTest.:P2PTransportImplTest.ConnectUdp:RenderViewTest.FLAKY_ImeComposition:P2PTransportImplTest.FLAKY_ConnectUdp:RenderViewTest.FAILS_ImeComposition:P2PTransportImplTest.ConnectTcp:SignedSettingsTest.StorePolicyNoPolicyData:VisitedLinkRelayTest.FAILS_Basics:PredictorTest.FLAKY_MassiveConcurrentLookupTest:VisitedLinkEventsTest.FAILS_Coalescense:FirefoxImporterTest.FLAKY_FirefoxNSS3Decryptor:VisitedLinkEventsTest.Coalescense:P2PTransportImplTest.FLAKY_ConnectTcp:P2PTransportImplTest.SendDataTcp:P2PTransportImplTest.FLAKY_Create:P2PTransportImplTest.FAILS_SendDataUdp:ServiceProcessStateTest.FLAKY_ForceShutdown:PredictorTest.MassiveConcurrentLookupTest:MultiProcessLockTest.FLAKY_RecursiveLock:VisitedLinkRelayTest.FLAKY_Basics:ConnectionTesterTest.RunAllTests:ServiceProcessStateTest.ForceShutdown:P2PTransportImplTest.SendDataUdp:MultiProcessLockTest.RecursiveLock:ProcessWatcherTest.FAILS_ImmediateTermination:SignedSettingsTest.FAILS_StorePolicyNoPolicyData:FirefoxImporterTest.FAILS_Firefox*NSS3Decryptor" >> OUT done done grep -E '^RUN:|test cases? ran' OUT

derekbruening commented 10 years ago

From bruen...@google.com on December 12, 2012 09:30:57

xref issue #1096

DynamoRIO / drmemory

improve performance of pointer scan for leak reachability analysis #151