DynamoRIO / drmemory

Memory Debugger for Windows, Linux, Mac, and Android
Other
2.45k stars 263 forks source link

improve performance of pointer scan for leak reachability analysis #151

Open derekbruening opened 10 years ago

derekbruening commented 10 years ago

From derek.br...@gmail.com on December 10, 2010 17:58:04

PR 475518

my initial implementation of scanning for pointers for reachability analysis produces a noticeable 1-second delay at app exit (and at nudge time for PR 428709). this case covers trying to speed that up. it's not identifying the defined region that's slow, it's the pointer scan itself.

adding to this case a FIXME from my code that may improve perf noticeably: skipping regions that have been read-only since being loaded (.text, .rodata, etc.). we'll need extra tracking since today we can't tell whether they were made writable and then changed back to read-only. we could also skip writable regions that have not been written to since being loaded since they can't point into the heap. technically the app could write a heap pointer to a file and then mmap it in but we'll ignore that.

my initial imp is going to skip r-x regions. I'm not going to skip all non-writable though. this case covers adding modified-since-load info to both avoid false neg w/ r-x and to skip more non-writable.

it does take substantial time on large apps. we should probably fix up -no_count_leaks and document it, and perhaps make it default so that nudges can be used for just updating the error summary w/o waiting 8 minutes? just turning it off today => bugs: maybe somebody else needs op_record_allocs in alloc.c. we still record for pre_us.

for PR 485354 I have the leak scan on windows skipping read-only image regions: once this case checks for history that can be done for *nix too

Server: perforce-panda.eng.vmware.com:1985

PR 520916: leak-check-only mode without losing accuracy

PR 485354: # possible leaks nondet on nudge test on Windows My leaks-only test hit PR 485354 so I investigated:

The beyond-TOS + zero-on-stack-alloc seems to be working in practice as far as I can tell (apps aren't exactly deterministic but results seem to match pretty well w/ full shadowing). Zeroing is not 100% transparent but it's close enough for me. We'll need more data on more apps but I think it's going to work out, and the overhead should be a strict subset of my original shadow-writes proposal. Unfortunately the overhead is still significant: 2.2x on crafty (vs 1.6x for DrHeapstat).

xref PR 536878: add leak checking feature to Dr. Heapstat a la Dr. Memory -leaks_only

future work under PR 539395: improve accuracy of -leaks_only:

Original issue: http://code.google.com/p/drmemory/issues/detail?id=151

derekbruening commented 10 years ago

From bruen...@google.com on January 12, 2012 11:37:45

wow looks like I pasted too much stuff into the initial entry text above

xref issue #568 (parallelize leak scan) xref issue #711 (improve callstack walk)

\ INFO breakdown of -leaks_only -no_zero_stack on ui_tests

this is NPAPITesterBase.NPObjectProxy

Process Name 64-bit Timer samples
ntoskrnl.exe 1 45.35
dynamorio.dll 22.47
drmemorylib.dll 12.31
dbghelp.dll 10.52
ntdll.dll 3.04
wow64.dll 1 2.89
wow64cpu.dll 1 2.26
Parity.sys 1 0.4
unknown module pid (6804) 1 0.27
9 modules, Total: 19377 samples, 99.51% of shown samples

CS:EIP Symbol + Offset Timer samples
0x2903a74 PsReturnProcessPagedPoolQuota 50.04
0x289b228 PsIsProtectedProcess 24.45
0x28cb490 KeSynchronizeExecution 10.79
0x2baa350 ObReferenceObjectByHandle 8.04
0x2916c24 KeQueryPriorityThread 1.12
0x28c65f0 memmove 0.88

CS:EIP Symbol + Offset Timer samples
0x6fe9b780 mutex_unlock 15.34
0x6fe9b740 mutex_trylock 14.65
0x6fee2470 dr_mutex_lock 12.55
0x6fee24d0 dr_mutex_unlock 11.06
0x6fe9d460 mutex_lock 10.17
0x6fe611f4 dynamorio_syscall_wow64 6.17
0x6fee7290 get_thread_private_dcontext 2.9
0x6fef70b0 query_virtual_memory 2.67
0x6fea24c0 common_heap_alloc 1.51
0x6fee7b70 query_memory_ex 1.28

CS:EIP Symbol + Offset Timer samples
0x6ffe7340 rb_in_node 59.78
0x6ffe5160 check_reachability_pointer 7.55
0x6ffee6c0 hashtable_lookup 7.05
0x6ffee4b0 hash_key 6.05
0x6ffd90c0 heap_region_bounds 6.01
0x6ffe5310 check_reachability_helper 3.75
0x6ffd6cf0 get_decoded_ptr 2.21
0x6ffee00a dr_mutex_unlock 1.67
0x6ffee010 dr_mutex_lock 1.63

CS:EIP Symbol + Offset Timer samples
0x54d730 MiniDumpReadDumpStream 99.51

analysis:

parallelizing seems most promising for large enough apps: no synch is needed on malloc lookup since read-only at that point. synch is needed when updating leak data structs but common case is a non-heap-pointer during scan.

Owner: bruen...@google.com
Labels: -Priority-Low -Type-Defect Priority-Medium Type-Enhancement Component-LeakCheck Performance

derekbruening commented 10 years ago

From bruen...@google.com on January 13, 2012 07:25:18

running entire unit_tests suite (minus tools/valgrind exclusions): RUN: -no_check_uninitialized [==========] 3554 tests from 577 test cases ran. (1636274 ms total) [==========] 3554 tests from 577 test cases ran. (1479432 ms total) [==========] 3554 tests from 577 test cases ran. (1518962 ms total) RUN: -no_check_uninitialized -no_leak_scan [==========] 3554 tests from 577 test cases ran. (1370891 ms total) [==========] 3554 tests from 577 test cases ran. (1409782 ms total) [==========] 3554 tests from 577 test cases ran. (1382568 ms total) RUN: -no_check_uninitialized -no_count_leaks [==========] 3554 tests from 577 test cases ran. (1198297 ms total) [==========] 3554 tests from 577 test cases ran. (1209558 ms total) [==========] 3554 tests from 577 test cases ran. (1260151 ms total)

getting the average in mins:

25.7482 -no_check_uninitialized 23.1291 -no_check_uninitialized -no_leak_scan 20.3778 -no_check_uninitialized -no_count_leaks => scan: 2.6 mins rest of leaks: 2.8 mins leak total: 5.4 mins = 26% perf hit

I wonder how many sub-processes there are, for the scan to take that long: bears investigation whether that's all on the final scan.

Suite log dir shows 9 sets of logdirs which makes sense since 3x3 runs. Each run has 10 unit_tests.exe logdirs. So it's 10 scans total that are taking the 2.6 mins. Probably not all equal, but average of 16s each which seems more reasonable.

derekbruening commented 10 years ago

From bruen...@google.com on January 13, 2012 07:33:39

leak overhead %scan %rest ui_tests 40% 78% 22% unit_tests 26% 48% 52%

derekbruening commented 10 years ago

From bruen...@google.com on January 17, 2012 08:25:38

more runs over weekend on windesk for some reason has -no_check_uninitialized slower: RUN: -no_check_uninitialized [==========] 3554 tests from 577 test cases ran. (1815211 ms total) [==========] 3554 tests from 577 test cases ran. (1891468 ms total) [==========] 3554 tests from 577 test cases ran. (1552920 ms total) RUN: -no_check_uninitialized -no_leak_scan [==========] 3554 tests from 577 test cases ran. (1450673 ms total) [==========] 3554 tests from 577 test cases ran. (1442520 ms total) [==========] 3554 tests from 577 test cases ran. (1438121 ms total) RUN: -no_check_uninitialized -no_count_leaks [==========] 3554 tests from 577 test cases ran. (1249905 ms total) [==========] 3554 tests from 577 test cases ran. (1222006 ms total) RUN: -no_check_uninitialized -no_count_leaks -no_use_symcache [==========] 3554 tests from 577 test cases ran. (3377504 ms total) [==========] 3554 tests from 577 test cases ran. (3350393 ms total) [==========] 3554 tests from 577 test cases ran. (3332369 ms total) RUN: -leaks_only [==========] 3554 tests from 577 test cases ran. (1090167 ms total) [==========] 3554 tests from 577 test cases ran. (1099449 ms total) [==========] 3554 tests from 577 test cases ran. (1095470 ms total) RUN: -leaks_only -no_zero_stack [==========] 3554 tests from 577 test cases ran. (1072919 ms total) [==========] 3554 tests from 577 test cases ran. (1077422 ms total) [==========] 3554 tests from 577 test cases ran. (1064477 ms total) RUN: -leaks_only -no_count_leaks [==========] 3554 tests from 577 test cases ran. (865069 ms total) [==========] 3554 tests from 577 test cases ran. (864243 ms total) [==========] 3554 tests from 577 test cases ran. (860628 ms total) RUN: -leaks_only -no_count_leaks -no_trackallocs [==========] 3554 tests from 577 test cases ran. (744878 ms total) [==========] 3554 tests from 577 test cases ran. (769345 ms total) [==========] 3554 tests from 577 test cases ran. (748245 ms total) RUN: native [==========] 3554 tests from 577 test cases ran. (704779 ms total) [==========] 3554 tests from 577 test cases ran. (580533 ms total) [==========] 3554 tests from 577 test cases ran. (572552 ms total) % perl -e 'while(){if(/^RUN: (.)/){$nr=$1;if($n>0){printf"%5.2f %-30s\n",$t/(601000$n),$run;}$run=$nr;$n=0;$t=0;}elsif(/(\d+) ms/){$t+=$1;$n++;}}if($n>0){printf"%5.2f %-30s\n",$t/(60_1000*$n),$run;}' 29.22 -no_check_uninitialized
24.06 -no_check_uninitialized -no_leak_scan 20.60 -no_check_uninitialized -no_count_leaks 55.89 -no_check_uninitialized -no_count_leaks -no_use_symcache 18.25 -leaks_only
17.86 -leaks_only -no_zero_stack
14.39 -leaks_only -no_count_leaks
12.57 -leaks_only -no_count_leaks -no_track_allocs 10.32 native

derekbruening commented 10 years ago

From bruen...@google.com on January 17, 2012 08:30:15

I ran like so:

for j in "-no_check_uninitialized" "-no_check_uninitialized -no_leak_scan" "-no_check_uninitialized -no_count_leaks" "-no_check_uninitialized -no_count_leaks -no_use_symcache" "-leaks_only" "-leaks_only -no_zero_stack" "-leaks_only -no_count_leaks" "-leaks_only -no_count_leaks -no_track_allocs"; do echo "RUN: $j" | tee -a OUT for ((i=0; i<3; i++)); do ~/drmemory/git/build_drmem_rel/bin/drmemory.exe $j -suppress 'e:\src\chromium\src\tools\valgrind\DrMemory\suppressions.txt' -quiet -batch -no_results_to_stderr -dr e:/src/dr/git/exports -- ./unit_tests.exe --gtest_filter="-P2PTransportImplTest.FAILS_SendDataTcp:ServiceProcessStateTest.FAILS_ForceShutdown:FirefoxImporterTest.FirefoxNSS3Decryptor:DownloadManagerTest.FLAKY_StartDownload:P2PTransportImplTest.Create:DownloadManagerTest.StartDownload:DownloadManagerTest.FAILS_StartDownload:P2PTransportImplTest.FLAKY_SendDataUdp:VisitedLinkRelayTest.Basics:ConnectionTesterTest.FLAKY_RunAllTests:MultiProcessLockTest.FAILS_RecursiveLock:SignedSettingsTest.FLAKY_StorePolicyNoPolicyData:ConnectionTesterTest.FAILS_RunAllTests:PredictorTest.FAILS_MassiveConcurrentLookupTest:P2PTransportImplTest.FAILS_ConnectTcp:RenderViewTest.ImeComposition:VisitedLinkEventsTest.FLAKY_Coalescense:VideoCaptureHostTest.:P2PTransportImplTest.FLAKY_SendDataTcp:P2PTransportImplTest.FAILS_ConnectUdp:ProcessWatcherTest.FLAKY_ImmediateTermination:P2PTransportImplTest.FAILS_Create:ProcessWatcherTest.ImmediateTermination:AudioRendererHostTest.:P2PTransportImplTest.ConnectUdp:RenderViewTest.FLAKY_ImeComposition:P2PTransportImplTest.FLAKY_ConnectUdp:RenderViewTest.FAILS_ImeComposition:P2PTransportImplTest.ConnectTcp:SignedSettingsTest.StorePolicyNoPolicyData:VisitedLinkRelayTest.FAILS_Basics:PredictorTest.FLAKY_MassiveConcurrentLookupTest:VisitedLinkEventsTest.FAILS_Coalescense:FirefoxImporterTest.FLAKY_FirefoxNSS3Decryptor:VisitedLinkEventsTest.Coalescense:P2PTransportImplTest.FLAKY_ConnectTcp:P2PTransportImplTest.SendDataTcp:P2PTransportImplTest.FLAKY_Create:P2PTransportImplTest.FAILS_SendDataUdp:ServiceProcessStateTest.FLAKY_ForceShutdown:PredictorTest.MassiveConcurrentLookupTest:MultiProcessLockTest.FLAKY_RecursiveLock:VisitedLinkRelayTest.FLAKY_Basics:ConnectionTesterTest.RunAllTests:ServiceProcessStateTest.ForceShutdown:P2PTransportImplTest.SendDataUdp:MultiProcessLockTest.RecursiveLock:ProcessWatcherTest.FAILS_ImmediateTermination:SignedSettingsTest.FAILS_StorePolicyNoPolicyData:FirefoxImporterTest.FAILS_Firefox*NSS3Decryptor" >> OUT done done grep -E '^RUN:|test cases? ran' OUT

derekbruening commented 10 years ago

From bruen...@google.com on December 12, 2012 09:30:57

xref issue #1096