reduce overhead of malloc tracking

derekbruening commented 9 years ago

From bruen...@google.com on June 16, 2011 10:24:33

below are perf numbers for -leaks_only -no_count_leaks vs native.

-no_count_leaks disables stack zeroing, so this is the cost of malloc tracking (w/o recording callstacks): this is way too high! this is one barrier to higher performance with -no_check_uninitialized.

spec2k6cmpave namedres/leaksonly-nocount-ref/ namedres/native-ref/ 400.perlbench 3.77 ( 1897 / 503) 401.bzip2 1.10 ( 951 / 865) 403.gcc 1.75 ( 750 / 429) 429.mcf 1.05 ( 393 / 373) 445.gobmk 1.58 ( 1009 / 639) 456.hmmer 1.09 ( 1301 / 1195) 458.sjeng 1.77 ( 1381 / 782) 462.libquantum 1.07 ( 793 / 739) 464.h264ref 1.67 ( 1751 / 1049) 471.omnetpp 3.72 ( 1559 / 419) 473.astar 1.32 ( 874 / 660) 483.xalancbmk 3.81 ( 1331 / 349) 410.bwaves 1.17 ( 875 / 751) 416.gamess 1.16 ( 1610 / 1392) 433.milc 1.14 ( 706 / 620) 434.zeusmp 1.14 ( 910 / 795) 435.gromacs 1.02 ( 1198 / 1170) 436.cactusADM 1.02 ( 1653 / 1617) 437.leslie3d 1.04 ( 851 / 816) 444.namd 1.03 ( 761 / 741) 447.dealII 2.17 ( 1299 / 599) 450.soplex 1.21 ( 488 / 404) 453.povray 1.74 ( 627 / 360) 454.calculix 1.08 ( 1616 / 1497) 459.GemsFDTD 1.05 ( 983 / 935) 465.tonto 4.29 ( 3613 / 842) 470.lbm 1.23 ( 795 / 648) 481.wrf 1.95 ( 2459 / 1258) 482.sphinx3 1.13 ( 840 / 741) average 1.66

back when we first created Dr. Heapstat I spent some time optimizing malloc tracking, including recording callstacks, and I sure thought I had performance better than this (I was looking at spec2000 and some malloc-intensive benchmarks like roboop and cfrac). something could have regressed.

from profiling, some of this cost is from maintaining the hashtables and delay-free queue, much of which could go away w/ malloc replacement b/c the header could be used more freely: or perhaps a larger redzone could be used with more data than just the size stored directly in it. early injection would make that much easier, since today no-redzone allocs must be handled: one reason for the current design.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=460

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 08:43:30

studying cfrac on windows:

symcache has no effect as 2nd run is no faster than 1st: % /usr/bin/time ~/drmemory/git/build_drmem_rel/bin/drmemory.exe -dr c:/src/dr/git/exports -batch -- ./cfrac.exe 41757646344123832613190542166099121 0.01user 0.01system 2:23.43elapsed 0%CPU (0avgtext+0avgdata 238336maxresident)k 0.00user 0.00system 2:25.32elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

% for i in "-no_count_leaks" "-no_check_uninitialized" "-no_check_uninitialized -no_count_leaks" "-leaks_only -no_count_leaks -no_zero_stack" "-perturb_only" "-leaks_only -no_count_leaks -no_zero_stack -no_track_allocs"; do echo $i; /usr/bin/time ~/drmemory/git/build_drmem_rel/bin/drmemory.exe $i -quiet -dr c:/src/dr/git/exports -batch -- ./cfrac.exe 41757646344123832613190542166099121; done -no_count_leaks 0.00user 0.00system 1:18.67elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k -no_check_uninitialized 0.00user 0.01system 1:46.22elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k -no_check_uninitialized -no_count_leaks 0.00user 0.03system 0:45.65elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k -leaks_only -no_count_leaks -no_zero_stack 0.00user 0.00system 0:35.09elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k -perturb_only 0.00user 0.03system 0:10.37elapsed 0%CPU (0avgtext+0avgdata 233728maxresident)k -leaks_only -no_count_leaks -no_zero_stack -no_track_allocs 0.00user 0.01system 0:03.44elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

-leaks_only -no_count_leaks -no_zero_stack + empty hook functions (alloc_hook, alloc_syscall_hook, and handle_alloc_post) (though I had to make handle_alloc_post have a conditional and not be entirely empty else crash b/c of some optimization) 0.00user 0.00system 0:06.41elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k

% /usr/bin/time ~/dr/git/exports/bin32/drrun.exe -quiet ./cfrac.exe 41757646344123832613190542166099121 0.01user 0.01system 0:02.50elapsed 1%CPU (0avgtext+0avgdata 238592maxresident)k

% /usr/bin/time ./cfrac.exe 41757646344123832613190542166099121 0.00user 0.01system 0:01.67elapsed 0%CPU (0avgtext+0avgdata 234496maxresident)k

I have a series of 10 or so optimizations including DRi#632 that get the 35 seconds down to about 24 seconds.

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 13:12:48

I have it down at 20 seconds now, 14s more than just clean calls. but note that leak scan/callstack time dwarfs that. clean calls seem to incur 2x here. malloc interception was 5x on top of that: now 3x.

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 13:21:19

-no_check_uninitialized w/ leak callstacks but no leak scan: 0.00user 0.01system 1:32.35elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k so the scan takes 14s; callstacks take 47s!

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 13:46:25

strike comment 3: that was measured w/ a dozen different new optimizations, so it needs a new base to compare to. the leak scan is not an issue.

this is with A through L (my internal names for the opts in my notes) so have to re-run base: -no_check_uninitialized 0.00user 0.03system 1:25.19elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 1:25.75elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 1:19.90elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.01user 0.00system 1:26.43elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k -no_check_uninitialized w/ leak callstacks but no leak scan: 0.00user 0.01system 1:32.35elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 1:29.23elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k 0.00user 0.00system 1:32.88elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k -no_check_uninitialized -no_count_leaks 0.00user 0.00system 0:27.48elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

strange, getting slower times w/o the scan. but the point here is that it's the callstacks that matter, not the scan. the callstacks cost a full minute, over a 3x slowdown. that's not what this case is about: but after hitting the low-hanging fruit here it may not be worth further optimization before optimizing callstacks (if possible); plus further opts here may require -early_injection.

-leaks_only -no_zero_stack (so malloc intercept + callstacks): 0.00user 0.00system 1:10.29elapsed 0%CPU (0avgtext+0avgdata 233728maxresident)k 0.00user 0.00system 1:14.33elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 1:11.55elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k -leaks_only -no_count_leaks (so just malloc intercept): 0.00user 0.00system 0:18.65elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 0:18.31elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.03system 0:18.57elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 14:18:34

-leaks_only -no_zero_stack -callstack_max_frames 6 0.00user 0.01system 0:59.23elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.03system 1:03.00elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.01user 0.00system 0:56.94elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k

-leaks_only -no_zero_stack -callstack_max_frames 3 0.00user 0.01system 0:35.73elapsed 0%CPU (0avgtext+0avgdata 233728maxresident)k 0.00user 0.00system 0:35.27elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 0:34.64elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

-leaks_only -no_zero_stack -delay_frees 0 0.00user 0.00system 1:14.24elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 1:12.00elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k 0.00user 0.00system 1:14.45elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 17:53:55

note that the above callstack walking is on cfrac built /Ox so not really fair: haven't looked in detail but it's probably having to scan frequently b/c of FPO. I will re-analyze w/ /Oy-.

I have CodeAnalyst data for a series of optimizations I labeled A through L in my notes and will commit separately. Probably not worth pasting all that in here.

I will paste here at least the two endpoints.

Wall-clock-time-wise: I've cut malloc interception cost in half. Dropped cfrac malloc-intercept-only from 35s to 19s, comparing to 3.4s no-intercept base: (gdb) p (19.-3.4)/(35.-3.4) $3 = 0.49367088607594939

Prior to optimizations A through L:

Process Name 64-bit Timer samples
dynamorio.dll 40.47
drmemorylib.dll 35.57
unknown module pid (6452) 1 18.84
hal.dll 1 2.2
ntdll.dll 1.24
ntoskrnl.exe 1 0.73
6 modules, Total: 34342 samples, 99.04% of shown samples

dynamorio: CS:EIP Symbol + Offset Timer samples
0x68c268b0 priv_mcontext_to_dr_mcontext 17.13
0x68c0b740 mutex_trylock 16.46
0x68c0b780 mutex_unlock 14.06
0x68c522c0 dr_mutex_lock 9.07
0x68c57120 get_thread_private_dcontext 6.66
0x68c52320 dr_mutex_unlock 4.48
0x68c52db0 dr_get_tls_field 4.4
0x68c124c0 common_heap_alloc 3.96
0x68c0d460 mutex_lock 3.88
0x68c54b60 dr_get_mcontext 3.59
0x68c0b830 release_recursive_lock 2.34
0x68c12750 common_heap_free 2.27
0x68c26880 dr_mcontext_to_priv_mcontext 1.58
0x68c12bb0 heap_alloc 1.2
0x68bd1287 dr_setjmp 1.18
0x68c0d4c0 acquire_recursive_lock 0.91
0x68c52080 dr_try_stop 0.86
0x68c52050 dr_try_setup 0.67
0x68c52d90 dr_get_current_drcontext 0.66
0x68c54be0 dr_set_mcontext 0.61
0x68c697a0 get_thread_id 0.58
0x68c129c0 global_heap_alloc 0.55
0x68c129e0 global_heap_free 0.48
0x68c12bf0 heap_free 0.46
0x68c52da0 dr_get_thread_id 0.45
0x68c12950 common_global_heap_free 0.32
0x68bd1280 dr_try_start 0.29
0x68c128f0 common_global_heap_alloc 0.26
28 functions, 287 instructions, Total: 13942 samples, 99.36% of shown samples, 9.68% of total session samples

drmemorylib: CS:EIP Symbol + Offset Timer samples
0x68b2cf30 hashtable_lookup 19.98
0x68b2cd20 hash_key 6.88
0x68b16db0 handle_alloc_pre_ex 6.1
0x68b2cdb0 keys_equal 5.38
0x68b12a10 get_alloc_entry 4.65
0x68b17240 handle_alloc_post_func 4.13
0x68b17570 alloc_hook 3.91
0x68b17470 handle_alloc_post 3.25
0x68b163a0 handle_free_pre 3.2
0x68b136b0 redzone_size 2.15
0x68b15ad0 handle_malloc_pre 2.08
0x68b15570 enter_heap_routine 1.98
0x68b155b0 check_recursive_same_sequence 1.74
0x68b24f00 safe_read 1.72
0x68b12900 is_alloc_prepost_routine 1.56
0x68b2c740 dr_get_current_drcontext 1.39
0x68b128c0 is_alloc_routine 1.26
0x68b2d1c0 hashtable_add_replace 1.26
0x68b167b0 handle_malloc_post 1.24
0x68b2c84e dr_mutex_unlock 1.18
0x68b12860 is_rtl_routine 1.03
0x68b157e0 record_allocator 0.98
0x68b2c854 dr_mutex_lock 0.97
0x68b14330 malloc_lock_held_by_self 0.91
0x68b166d0 adjust_alloc_result 0.85
0x68b12770 is_free_routine 0.83
0x68b143e0 malloc_add_common 0.83
0x68b2d360 hashtable_remove 0.82
0x68b14360 malloc_lock 0.79
0x68b2c728 dr_get_tls_field 0.79
0x68b154c0 get_retaddr_at_entry 0.74
0x68b2cf10 hashtable_lock 0.71
0x68b15800 handle_free_check_mismatch 0.7
0x68b2cf20 hashtable_unlock 0.69
0x68b156b0 set_auxarg 0.66
0x68b2c73a dr_get_thread_id 0.61
0x68b12b90 is_replace_routine 0.58
0x68b1b120 client_handle_malloc 0.51
0x68b12790 is_malloc_routine 0.5
0x68b1af10 shared_callstack_free 0.46
0x68afe320 client_exiting_heap_routine 0.45
0x68b158d0 handle_free_post 0.45
0x68b14510 malloc_entry_remove 0.44
0x68afe300 client_entering_heap_routine 0.4
0x68b148a0 malloc_size 0.37
0x68b2c76a dr_get_mcontext 0.37
0x68b1b400 client_handle_free 0.36
0x68b25a00 global_free 0.36
0x68b2cfa0 hashtable_check_for_resize 0.36
0x68b14730 malloc_alloc_type 0.35
0x68b15530 malloc_is_native 0.34
0x68b143b0 malloc_lock_if_not_held_by_me 0.33
0x68b136e0 get_size_from_app_routine 0.32
0x68b13bd0 malloc_allocator_type 0.31
0x68b259f0 global_alloc 0.31
0x68b16650 get_alloc_real_size 0.3
0x68b25b40 hashwrap_alloc 0.27
0x68b13980 malloc_hash 0.26
0x68b12730 is_size_routine 0.24
0x68b12ab0 size_func_in_set 0.24
0x68b16250 get_alloc_size 0.24
0x68b127b0 is_realloc_routine 0.19
0x68b13960 malloc_entry_free 0.19
0x68b143d0 malloc_unlock_if_locked_by_me 0.19
0x68b14710 malloc_entry_is_pre_us 0.19
0x68b23b30 leak_handle_alloc 0.19
0x68b14390 malloc_unlock 0.18
0x68b15ac0 size_plus_redzone_overflow 0.16
68 functions, 1009 instructions, Total: 12125 samples, 98.31% of shown samples, 8.42% of total session samples

after:

keep in mind that these are percentages (would be nice to have a column of absolute hits as well): so while dr_get_tls_field() was actually cut in half, the percentage here looks higher b/c its on a smaller total # samples:

Process Name 64-bit Timer samples
drmemorylib.dll 36.53
unknown module pid (7060) 1 32.76
dynamorio.dll 26.37
hal.dll 1 1.45
ntdll.dll 0.74
msvcr90.dll 0.67
ntoskrnl.exe 1 0.52
7 modules, Total: 19219 samples, 99.04% of shown samples

dynamorio: CS:EIP Symbol + Offset Timer samples
0x6ea0b740 mutex_trylock 15.59
0x6ea0b780 mutex_unlock 8.5
0x6ea0b830 release_recursive_lock 7.95
0x6ea0b890 read_lock 7.7
0x6ea52f90 dr_get_tls_field 7.52
0x6ea0b9f0 read_unlock 6.14
0x6ea524b0 dr_mutex_lock 6.04
0x6ea572a0 get_thread_private_dcontext 5.24
0x6ea54c60 dr_get_mcontext_priv 4.67
0x6ea124c0 common_heap_alloc 4.57
0x6ea26970 priv_mcontext_to_dr_mcontext 3.48
0x6ea0d460 mutex_lock 3.46
0x6ea52510 dr_mutex_unlock 2.75
0x6ea12750 common_heap_free ...

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 17:53:55

... 2.44
0x6ea698f0 get_thread_id 1.37
0x6ea0d4c0 acquire_recursive_lock 1.29
0x6ea52f70 dr_get_current_drcontext 1.25
0x6ea268d0 dr_mcontext_to_priv_mcontext 1.19
0x6ea129e0 global_heap_free 1.17
0x6ea52f80 dr_get_thread_id 1.15
0x6ea54d70 dr_set_mcontext 1.13
0x6ea128f0 common_global_heap_alloc 1.07
0x6ea54d50 dr_get_mcontext 0.98
23 functions, 238 instructions, Total: 4947 samples, 96.66% of shown samples, 6.19% of total session samples

drmemorylib: CS:EIP Symbol + Offset Timer samples
0x6fedce00 hashtable_lookup 10.51
0x6fec6f70 handle_alloc_pre_ex 9.07
0x6fec75f0 handle_alloc_post 6.04
0x6fedcbf0 hash_key 5.8
0x6fec73c0 handle_alloc_post_func 5.7
0x6fec7710 alloc_hook 4.71
0x6fec6580 handle_free_pre 4.09
0x6fedcc80 keys_equal 3.91
0x6fec5730 check_recursive_same_sequence 3.81
0x6fec56d0 enter_heap_routine 3.36
0x6fec5c40 handle_malloc_pre 3.15
0x6fec3780 redzone_size 2.93
0x6fec5980 record_allocator 2.2
0x6fec6970 handle_malloc_post 1.69
0x6fec46d0 malloc_entry_remove 1.62
0x6fec45a0 malloc_add_common 1.5
0x6fedc600 dr_get_current_drcontext 1.45
0x6fec2990 is_rtl_routine 1.44
0x6fedd090 hashtable_add_replace 1.33
0x6fec4520 malloc_lock 1.31
0x6fec68a0 adjust_alloc_result 1.3
0x6fedd230 hashtable_remove 1.19
0x6fec28a0 is_free_routine 1.11
0x6fec59a0 handle_free_check_mismatch 1.02
0x6fec44f0 malloc_lock_held_by_self 1
0x6fec5850 set_auxarg 0.97
0x6fedc5e8 dr_get_tls_field 0.87
0x6fec48f0 malloc_alloc_type 0.76
0x6fecb470 client_handle_free 0.73
0x6fedcde0 hashtable_lock 0.68
0x6fec4a60 malloc_size 0.66
0x6fecb190 client_handle_malloc 0.65
0x6fedcdf0 hashtable_unlock 0.63
0x6fec4570 malloc_lock_if_not_held_by_me 0.62
0x6fed5880 global_alloc 0.59
0x6fed5890 global_free 0.58
0x6fec5690 malloc_is_native 0.54
0x6fec2c50 is_replace_routine 0.52
38 functions, 726 instructions, Total: 6383 samples, 90.05% of shown samples, 7.99% of total session samples

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 17:56:51

List of optimizations and other changes:

issue #705: skip unknown syscall handling for -no_check_uninitialized
issue #706: get -no_track_allocs to work
issue #708: add -conservative option and pass pointers that may becoming invalid on racy module unload
issue #460: improve malloc interception performance: I managed to cut the malloc interception cost in half. There are many sub-pieces here, some of which are filed and listed separately. Those covered by issue #460 include:
- statically tell hook whether syscall or needs post
- improve performance of alloc_check_pending_module(), which is called on every instruction, via a last_page cache and a flag for whether the rbtree is empty
- pending-module improvement: load postcall entries from symcache immediately if module is already loaded at event time
- switch to a readwrite lock for the postcall table
- use alloc_routine_entry_t.intercept_post instead of re-deriving
- tweak malloc hash func and resize threshold to reduce collisions
- only safe_read retaddr in hook callee if -conservative"
- pass in xsp and avoid dr_get_mcontext altogether in pre hook since it only needs xsp (for 32-bit where params are all on stack)
- pass per_thread_t around instead of calling dr_get_tls_field() everywhere
DRi#632: add dr_mcontext_t flags to select which parts of the context to copy and avoid perf hit of copying ymm arrays

derekbruening commented 9 years ago

From bruen...@google.com on December 07, 2011 19:15:36

indeed /Ox is scanning a lot: app mallocs: 10890330, frees: 10890127, large mallocs: 0 unique malloc stacks: 7054500 callstack fp scans: 61058402 callstack is_retaddr: 924470226, backdecode: 153355929, unreadable: 0

/Ox /Oy-: app mallocs: 10890330, frees: 10890127, large mallocs: 0 unique malloc stacks: 7050289 callstack fp scans: 0 callstack is_retaddr: 10890130, backdecode: 10890130, unreadable: 0

(unfortunately on linux there is only 1 scan, and adding -fno-omit-frame-pointer makes no difference)

splitting off any work on callstack walking perf to issue #711 . see also in that issue a new breakdown of times for cfrac built /Oy- (huge difference).

derekbruening commented 9 years ago

From bruen...@google.com on December 08, 2011 07:21:00

spec2k6cmpave namedres/leaksonly-nocount-i460-A-L-ref/ namedres/native-ref 400.perlbench 3.02 ( 1521 / 503) 401.bzip2 1.10 ( 948 / 865) 403.gcc 1.68 ( 719 / 429) 429.mcf 1.02 ( 380 / 373) 445.gobmk 1.56 ( 1000 / 639) 456.hmmer 1.09 ( 1297 / 1195) 458.sjeng 1.76 ( 1374 / 782) 462.libquantum 1.21 ( 894 / 739) 464.h264ref 1.67 ( 1755 / 1049) 471.omnetpp 3.09 ( 1293 / 419) 473.astar 1.31 ( 863 / 660) 483.xalancbmk 3.57 ( 1245 / 349) 410.bwaves 1.07 ( 803 / 751) 416.gamess 1.15 ( 1598 / 1392) 433.milc 1.22 ( 757 / 620) 434.zeusmp 1.08 ( 857 / 795) 435.gromacs 1.02 ( 1197 / 1170) 436.cactusADM 1.01 ( 1635 / 1617) 437.leslie3d 1.04 ( 851 / 816) 444.namd 1.03 ( 764 / 741) 447.dealII 1.93 ( 1156 / 599) 450.soplex 1.14 ( 459 / 404) 453.povray 1.73 ( 624 / 360) 454.calculix 1.07 ( 1609 / 1497) 459.GemsFDTD 1.00 ( 934 / 935) 465.tonto 3.19 ( 2685 / 842) 470.lbm 1.07 ( 692 / 648) 481.wrf 1.71 ( 2147 / 1258) 482.sphinx3 1.12 ( 830 / 741) average 1.54

vs old 1.66 => 9.3% improvement

direct comparison shows 26% win on tonto, 20% on perlbench:

spec2k6cmpave namedres/leaksonly-nocount-i460-A-L-ref/ namedres/leaksonly-nocount-ref/ 400.perlbench 0.80 ( 1521 / 1897) 401.bzip2 1.00 ( 948 / 951) 403.gcc 0.96 ( 719 / 750) 429.mcf 0.97 ( 380 / 393) 445.gobmk 0.99 ( 1000 / 1009) 456.hmmer 1.00 ( 1297 / 1301) 458.sjeng 0.99 ( 1374 / 1381) 462.libquantum 1.13 ( 894 / 793) 464.h264ref 1.00 ( 1755 / 1751) 471.omnetpp 0.83 ( 1293 / 1559) 473.astar 0.99 ( 863 / 874) 483.xalancbmk 0.94 ( 1245 / 1331) 410.bwaves 0.92 ( 803 / 875) 416.gamess 0.99 ( 1598 / 1610) 433.milc 1.07 ( 757 / 706) 434.zeusmp 0.94 ( 857 / 910) 435.gromacs 1.00 ( 1197 / 1198) 436.cactusADM 0.99 ( 1635 / 1653) 437.leslie3d 1.00 ( 851 / 851) 444.namd 1.00 ( 764 / 761) 447.dealII 0.89 ( 1156 / 1299) 450.soplex 0.94 ( 459 / 488) 453.povray 1.00 ( 624 / 627) 454.calculix 1.00 ( 1609 / 1616) 459.GemsFDTD 0.95 ( 934 / 983) 465.tonto 0.74 ( 2685 / 3613) 470.lbm 0.87 ( 692 / 795) 481.wrf 0.87 ( 2147 / 2459) 482.sphinx3 0.99 ( 830 / 840) average 0.96

derekbruening commented 9 years ago

From bruen...@google.com on December 08, 2011 12:37:19

unfortunately the optimizations done here so far (A through L) don't have much impact on ui_tests. it seems to spend most of its time on instrumentation, even when there's really not much instrumentation being done:

results on windesk for HEAD (i.e., no issue #460 opts): % ./batch.sh native [----------] 1 test from NPAPITesterBase (2814 ms total) [----------] 1 test from NPAPITesterBase (2485 ms total) [----------] 1 test from NPAPITesterBase (2506 ms total) DR [----------] 1 test from NPAPITesterBase (13527 ms total) [----------] 1 test from NPAPITesterBase (12923 ms total) [----------] 1 test from NPAPITesterBase (13342 ms total)

[----------] 1 test from NPAPITesterBase (544662 ms total) [----------] 1 test from NPAPITesterBase (414874 ms total) [----------] 1 test from NPAPITesterBase (430466 ms total) -no_count_leaks [----------] 1 test from NPAPITesterBase (329820 ms total) [----------] 1 test from NPAPITesterBase (388453 ms total) [----------] 1 test from NPAPITesterBase (379673 ms total) -no_check_uninitialized [----------] 1 test from NPAPITesterBase (71159 ms total) [----------] 1 test from NPAPITesterBase (72336 ms total) [----------] 1 test from NPAPITesterBase (70873 ms total) -no_check_uninitialized -no_count_leaks [----------] 1 test from NPAPITesterBase (49564 ms total) [----------] 1 test from NPAPITesterBase (51328 ms total) [----------] 1 test from NPAPITesterBase (51490 ms total) -leaks_only [----------] 1 test from NPAPITesterBase (46393 ms total) [----------] 1 test from NPAPITesterBase (46452 ms total) [----------] 1 test from NPAPITesterBase (58235 ms total) -leaks_only -no_zero_stack [----------] 1 test from NPAPITesterBase (43570 ms total) [----------] 1 test from NPAPITesterBase (43709 ms total) [----------] 1 test from NPAPITesterBase (45537 ms total) -leaks_only -no_count_leaks [----------] 1 test from NPAPITesterBase (23473 ms total) [----------] 1 test from NPAPITesterBase (24109 ms total) [----------] 1 test from NPAPITesterBase (23976 ms total) -leaks_only -no_count_leaks -no_track_allocs <crashes: fix issue #706 not in>

results for optimized (i.e., drmem has A through L): % ./batch.sh native [----------] 1 test from NPAPITesterBase (3059 ms total) [----------] 1 test from NPAPITesterBase (2349 ms total) [----------] 1 test from NPAPITesterBase (2401 ms total) DR [----------] 1 test from NPAPITesterBase (12905 ms total) [----------] 1 test from NPAPITesterBase (12950 ms total) [----------] 1 test from NPAPITesterBase (13154 ms total) DR -code_api -disable_traces -bb_single_restore_prefix -max_bb_instrs 256 [----------] 1 test from NPAPITesterBase (8144 ms total) [----------] 1 test from NPAPITesterBase (8221 ms total) [----------] 1 test from NPAPITesterBase (7747 ms total)

[----------] 1 test from NPAPITesterBase (481645 ms total) [----------] 1 test from NPAPITesterBase (323987 ms total) [----------] 1 test from NPAPITesterBase (162724 ms total) <--- ?!? -no_count_leaks [----------] 1 test from NPAPITesterBase (315300 ms total) [----------] 1 test from NPAPITesterBase (154741 ms total) [----------] 1 test from NPAPITesterBase (308583 ms total) -no_check_uninitialized [----------] 1 test from NPAPITesterBase (68615 ms total) [----------] 1 test from NPAPITesterBase (68943 ms total) [----------] 1 test from NPAPITesterBase (69362 ms total) -no_check_uninitialized -no_count_leaks [----------] 1 test from NPAPITesterBase (47861 ms total) [----------] 1 test from NPAPITesterBase (48193 ms total) [----------] 1 test from NPAPITesterBase (48197 ms total) -leaks_only [----------] 1 test from NPAPITesterBase (44453 ms total) [----------] 1 test from NPAPITesterBase (44938 ms total) [----------] 1 test from NPAPITesterBase (45062 ms total) -leaks_only -no_zero_stack [----------] 1 test from NPAPITesterBase (42587 ms total) [----------] 1 test from NPAPITesterBase (42854 ms total) [----------] 1 test from NPAPITesterBase (44263 ms total) -leaks_only -no_count_leaks [----------] 1 test from NPAPITesterBase (23529 ms total) [----------] 1 test from NPAPITesterBase (23139 ms total) [----------] 1 test from NPAPITesterBase (23799 ms total) -leaks_only -no_count_leaks -no_track_allocs [----------] 1 test from NPAPITesterBase (21472 ms total) [----------] 1 test from NPAPITesterBase (21593 ms total) [----------] 1 test from NPAPITesterBase (21742 ms total)

splitting investigating why -no_track_allocs is so slow to issue #714

derekbruening commented 9 years ago

From bruen...@google.com on April 12, 2012 09:22:51

xref issue #794 , issue #721 , issue #793 , issue #689

DynamoRIO / drmemory

reduce overhead of malloc tracking #460