DynamoRIO / drmemory

Memory Debugger for Windows, Linux, Mac, and Android
Other
2.43k stars 262 forks source link

cache last callstack for further callstack walking performance improvement #1187

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

From bruen...@google.com on April 16, 2013 19:53:10

Callstack walking for optimized code is complex and I could cite quite a few issue #s here on tweaks for both functionality and performance. This issue covers going a step beyond the work in issue #1186 and caching the full last callstack, with fp's, to try and avoid find_next_fp() even more. It still shows up on perlbench diffmail:

1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k ITIMER distribution (182407): 0.0% of time in APPLICATION (1) 3.2% of time in INTERPRETER (5785) 0.3% of time in DISPATCH (628) 0.1% of time in SYSCALL HANDLER (103) 3.0% of time in INDIRECT BRANCH LOOKUP (5421) 43.8% of time in FRAGMENT CACHE (79891) 49.7% of time in UNKNOWN (90578) RES-pcsamples.0.7276.html 897 get_shadow_table 1011 add_to_delay_list 1109 bitmapx2_set 1262 find_free_list_entry 1307 packed_callstack_hash 1438 safe_read 1662 module_lookup 2334 shadow_set_range 2458 address_to_frame 2999 rb_in_node 3925 print_callstack 5895 find_next_fp

Original issue: http://code.google.com/p/drmemory/issues/detail?id=1187

derekbruening commented 9 years ago

From bruen...@google.com on April 17, 2013 08:21:24

**\ TODO initial implementation: stats look good, but no perf win on cfrac https://codereview.appspot.com/8789046/ cfrac full mode:

grep ^callstack ls -1td logs/D*|head -1/g* callstack fp scans: 1, fp cache hits: 5445061 callstack walks: 10890123, cache hits: 7033166 callstack is_retaddr: 10890140, backdecode: 10890127, unreadable: 0

w/o issue #1186 find_next_fp cache or issue #1187 cstack cache: 51.26user 0.04system 0:51.42elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k ITIMER distribution (5120): 2.6% of time in INTERPRETER (134) 1.1% of time in INDIRECT BRANCH LOOKUP (57) 20.7% of time in FRAGMENT CACHE (1061) 75.5% of time in UNKNOWN (3868) RES-pcsamples.0.891.html 95 address_to_frame 98 shadow_set_range 107 get_shadow_table 114 bitmapx2_byte 152 find_next_fp 160 print_callstack 175 rb_in_node

w/ issue #1186 find_next_fp cache: 38.75user 0.04system 0:38.88elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k ITIMER distribution (3867): 0.4% of time in INTERPRETER (17) 1.2% of time in INDIRECT BRANCH LOOKUP (47) 26.0% of time in FRAGMENT CACHE (1005) 72.4% of time in UNKNOWN (2798) RES-pcsamples.0.6680.html 58 module_lookup 65 memset 65 rb_in_node 76 address_to_frame 86 shadow_set_range 183 print_callstack

w/ issue #1187 cstack cache: 45.29user 0.04system 0:45.43elapsed 99%CPU (0avgtext+0avgdata 11840maxresident)k ITIMER distribution (4523): 0.0% of time in APPLICATION (1) 0.6% of time in INTERPRETER (26) 1.5% of time in INDIRECT BRANCH LOOKUP (70) 21.6% of time in FRAGMENT CACHE (979) 76.2% of time in UNKNOWN (3447) RES-pcsamples.0.11656.html 74 address_to_frame 88 shadow_set_range 95 find_next_fp 101 rb_in_node 148 fpra_cache_check 174 print_callstack

w/ both: 39.23user 0.04system 0:39.36elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k ITIMER distribution (3917): 1.5% of time in INTERPRETER (58) 1.8% of time in INDIRECT BRANCH LOOKUP (69) 26.7% of time in FRAGMENT CACHE (1044) 70.1% of time in UNKNOWN (2746) RES-pcsamples.0.17198.html 47 safe_read 50 rb_in_node 57 address_to_frame 58 __x86.get_pc_thunk.bx 59 memset 62 packed_callstack_hash 113 shadow_set_range 166 print_callstack 168 fpra_cache_check

**\ TODO slight speedup on perlbench

this is the copy from goobuntu, diffmail run:

/usr/bin/time /work/drmemory/git/build_x86_rel/bin/drmemory.pl -replace_malloc -dr_ops "-msgbox_mask 12 -prof_pcs" -pause_at_assert -dr /work/dr/git/exports -- /extsw/spec2006/v1.2-perlbench/perlbench_base.gcc43-32bit -I./lib diffmail.pl 4 800 10 17 19 300 > diffmail.4.800.10.17.19.300.out

1719.92user 5.65system 28:51.26elapsed 99%CPU (0avgtext+0avgdata 429740maxresident)k ITIMER distribution (171083): 0.0% of time in APPLICATION (2) 2.7% of time in INTERPRETER (4553) 0.5% of time in DISPATCH (926) 0.2% of time in SYSCALL HANDLER (345) 3.1% of time in INDIRECT BRANCH LOOKUP (5318) 45.7% of time in FRAGMENT CACHE (78125) 47.8% of time in UNKNOWN (81814) RES-pcsamples.0.28924.html 1110 module_lookup 1139 add_to_delay_list 1320 packed_callstack_hash 1335 bitmapx2_set 1345 safe_read 1350 find_free_list_entry 1730 address_to_frame 1797 rb_in_node 2263 shadow_set_range 2575 fpra_cache_check 3286 find_next_fp 4000 print_callstack

this is vs the final issue #1186 run, pasting from below: 1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k

debug run: app mallocs: 278012393, frees: 277917410, large mallocs: 10026 unique malloc stacks: 11512032 callstack fp scans: 21,553,498, fp cache hits: 851,322,232 callstack walks: 278,016,823, cache hits: 207,739,067 callstack is_retaddr: 611533459, backdecode: 443968947, unreadable: 0

again, fpra_cache_check shows up: can we speed that up?

**\ TODO correctness: seems to have some problems? => bailing

The performance numbers show that while it has benefits in the absence of the issue #1186 find_next_fp single-frame cache, it is not as effective, so we want both. When combined with the single-frame cache it actually causes a slight slowdown on cfrac and a slight speedup on perlbench diffmail.

fpra_cache_check() shows up in profiling. For every frame on a new callstack walk, we have to iterate the whole cache to look for a tail match, as we can't just blindly use the whole tail. Any ideas on how to improve this?

Running drheapstat on cfrac w/ and w/o this feature and comparing callstack.log: 195 unique callstacks w/o, 216 w/: seems like there's a correctness problem. Unless you can see something I'm missing here, I may abandon this as it will take a bunch more work to ensure correctness and there's still no guarantee of a perf win (though in theory it seems like we should be able to get a win).

perf summary: cfrac full mode: base (w/o issue #1186 find_next_fp cache or issue #1187 cstack cache): 51.26user 0.04system 0:51.42elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k w/ issue #1186 find_next_fp cache: 38.75user 0.04system 0:38.88elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k w/ issue #1187 cstack cache: 45.29user 0.04system 0:45.43elapsed 99%CPU (0avgtext+0avgdata 11840maxresident)k w/ both: 39.23user 0.04system 0:39.36elapsed 99%CPU (0avgtext+0avgdata 11108maxresident)k

perlbench diffmail: base (w/o either): 3359.52user 5.47system 56:14.44elapsed 99%CPU (0avgtext+0avgdata 425808maxresident)k w/ issue #1186 find_next_fp cache: 1833.59user 5.72system 30:45.28elapsed 99%CPU (0avgtext+0avgdata 429768maxresident)k w/ both: 1719.92user 5.65system 28:51.26elapsed 99%CPU (0avgtext+0avgdata 429740maxresident)k

I implemented it quickly, so it's possible it's worthwhile if done right: and it makes sense for repeated malloc w/ similar base callstack. But I can't seem to get much of a win.

Bailing for now. Leaving this issue open for either trying something like this cache again, or some other idea to improve callstack walking.