derekbruening commented 10 years ago

From bruen...@google.com on January 12, 2012 15:18:14

What is the problem to solve? Why is it important? Provide some context for those unfamiliar with the details of the system. We would like a faster "light mode" that truly focuses on only detecting unaddressable errors. What are the possible approaches to solving the problem? Currently we're using shadow memory. We plan to use a pattern-based approach instead. Xref the 1992 Purify paper and other references to using patterns rather than shadow memory. Patterns are difficult to use for detecting uninitialized reads where we want to delay reporting, but for unaddressable errors they're a natural fit as we can put the patterns into the redzones and report immediately. Which approach is being taken and why? There are tradeoffs in the pattern size and detection of unaligned references, where larger patterns have fewer false matches but require either giving up unaligned detection or using a "medium path". Initially we will likely abandon detection of certain unaligned references. This is, after all, a "light mode", and finding extreme corner cases is not worth a performance impact on finding common cases for "light mode". Any interesting details or challenges of the implementation? By performing a direct compare, we only need to preserve aflags and do not need to spill any other registers than eax for aflags. By using a fault on the slowpath, we have no table or non-local jump or call, making all our instrumentation PIC and thus making it easy to persist.

We may try to adaptively change the actual pattern used based on frequencies observed in the application.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=750

derekbruening commented 10 years ago

From bruen...@google.com on January 17, 2012 08:26:54

see issue #151 comment 2 and comment 4 for some numbers to try and beat on unit_tests

derekbruening commented 10 years ago

From zhao...@google.com on January 17, 2012 11:50:30

Preliminary results:

|----------------+ | Benchmarks |----------------+ | 400.perlbench | | 401.bzip2 | 403.gcc | 429.mcf | 445.gobmk | 456.hmmer | 458.sjeng | 462.libquantum | | 464.h264ref | 471.omnetpp | 473.astar | 483.xalancbmk | |----------------+ | 410.bwaves | 416.gamess | 433.milc | 434.zeusmp | 435.gromacs | 436.cactusADM | | 437.leslie3d | 444.namd | 447.dealII | 450.soplex | 453.povray | 454.calculix | 459.GemsFDTD | 465.tonto | 470.lbm | 481.wrf | 482.sphinx3 |----------------+ | Average |----------------+ --------+------+------------+-------+-----------+------+-----------| | Native | DR | DR/Na | bbcnt | bbc/Na | DrM | DrM/Na | --------+------+------------+-------+-----------+------+-----------| 404 | 609 | 1.5074257 | 1230 | 3.0445545 | | | | 711 | 751 | 1.0562588 | 1131 | 1.5907173 | 2047 | 2.8790436 | | 339 | 459 | 1.3539823 | 816 | 2.4070796 | 950 | 2.8023599 | | 283 | 286 | 1.0106007 | 322 | 1.1378092 | 371 | 1.3109541 | | 520 | 740 | 1.4230769 | 962 | 1.85 | 1241 | 2.3865385 | | 979 | 966 | 0.98672114 | 1301 | 1.3289070 | 2204 | 2.2512768 | | 606 | 801 | 1.3217822 | 1442 | 2.3795380 | 1383 | 2.2821782 | 801 | 806 | 1.0062422 | 997 | 1.2446941 | 1178 | 1.4706617 | | 869 | 1063 | 1.2232451 | 1482 | 1.7054085 | 2694 | 3.1001151 | | 320 | 373 | 1.165625 | 704 | 2.2 | 1795 | 5.609375 | | 533 | 541 | 1.0150094 | 670 | 1.2570356 | 1141 | 2.1407129 | 262 | 325 | 1.2404580 | 781 | 2.9809160 | 1373 | 5.2404580 | --------+------+------------+-------+-----------+------+-----------| | 597 | 599 | 1.0033501 | 655 | 1.0971524 | 792 | 1.3266332 | | 1144 | 1176 | 1.0279720 | 1747 | 1.5270979 | 3128 | 2.7342657 | | 500 | 516 | 1.032 | 607 | 1.214 | 1005 | 2.01 | | 636 | 638 | 1.0031447 | 1193 | 1.8757862 | 1321 | 2.0770440 | | 957 | 966 | 1.0094044 | 1034 | 1.0804598 | 1768 | 1.8474399 | 1186 | 1194 | 1.0067454 | 1218 | 1.0269815 | 2570 | 2.1669477 | | 1039 | 1043 | 1.0038499 | 1102 | 1.0606352 | 1285 | 1.2367661 | | 620 | 620 | 1 | 697 | 1.1241935 | 1431 | 2.3080645 | | 502 | 507 | 1.0099602 | 1040 | 2.0717131 | 1755 | 3.4960159 | | 309 | 324 | 1.0485437 | 639 | 2.0679612 | 836 | 2.7055016 | | 291 | 334 | 1.1477663 | 623 | 2.1408935 | 1040 | 3.5738832 | | 1244 | 1238 | 0.99517685 | 1549 | 1.2451768 | | | | 1054 | 1072 | 1.0170778 | 1114 | 1.0569260 | | | | 729 | 794 | 1.0891632 | 1182 | 1.6213992 | | | | 438 | 443 | 1.0114155 | 471 | 1.0753425 | 2337 | 5.3356164 | | 1045 | 1101 | 1.0535885 | 1485 | 1.4210526 | 3102 | 2.9684211 | | 604 | 602 | 0.99668874 | 872 | 1.4437086 | 1174 | 1.9437086 | --------+------+------------+-------+-----------+------+-----------| | | | 1.0953888 | | 1.6302462 | | 2.6881593 | --------+------+------------+-------+-----------+------+-----------|

derekbruening commented 10 years ago

From zhao...@google.com on January 18, 2012 07:18:36

Performance of original shadow memory based approach (nouninit,noleak):

|----------------+--------+------+------------+------+-----------| | Benchmarks | Native | DR | DR/Na | DrM | DrM/Na | |----------------+--------+------+------------+------+-----------| | 400.perlbench | 404 | 609 | 1.5074257 | 3473 | 8.5965347 | | 401.bzip2 | 711 | 751 | 1.0562588 | 1774 | 2.4950774 | | 403.gcc | 339 | 459 | 1.3539823 | 1485 | 4.3805310 | | 429.mcf | 283 | 286 | 1.0106007 | 493 | 1.7420495 | | 445.gobmk | 520 | 740 | 1.4230769 | 1930 | 3.7115385 | | 456.hmmer | 979 | 966 | 0.98672114 | 3421 | 3.4943820 | | 458.sjeng | 606 | 801 | 1.3217822 | 2417 | 3.9884488 | | 462.libquantum | 801 | 806 | 1.0062422 | 2106 | 2.6292135 | | 464.h264ref | 869 | 1063 | 1.2232451 | 8317 | 9.5707710 | | 471.omnetpp | 320 | 373 | 1.165625 | 2743 | 8.571875 | | 473.astar | 533 | 541 | 1.0150094 | 1214 | 2.2776735 | | 483.xalancbmk | 262 | 325 | 1.2404580 | 2014 | 7.6870229 | |----------------+--------+------+------------+------+-----------| | 410.bwaves | 597 | 599 | 1.0033501 | 1169 | 1.9581240 | | 416.gamess | 1144 | 1176 | 1.0279720 | 3505 | 3.0638112 | | 433.milc | 500 | 516 | 1.032 | 1030 | 2.06 | | 434.zeusmp | 636 | 638 | 1.0031447 | 1662 | 2.6132075 | | 435.gromacs | 957 | 966 | 1.0094044 | 1320 | 1.3793103 | | 436.cactusADM | 1186 | 1194 | 1.0067454 | 1540 | 1.2984823 | | 437.leslie3d | 1039 | 1043 | 1.0038499 | 2555 | 2.4590953 | | 444.namd | 620 | 620 | 1 | 1201 | 1.9370968 | | 447.dealII | 502 | 507 | 1.0099602 | 2259 | 4.5 | | 450.soplex | 309 | 324 | 1.0485437 | 1008 | 3.2621359 | | 453.povray | 291 | 334 | 1.1477663 | 1237 | 4.2508591 | | 454.calculix | 1244 | 1238 | 0.99517685 | 3042 | 2.4453376 | | 459.GemsFDTD | 1054 | 1072 | 1.0170778 | 2368 | 2.2466793 | | 465.tonto | 729 | 794 | 1.0891632 | 3335 | 4.5747599 | | 470.lbm | 438 | 443 | 1.0114155 | 767 | 1.7511416 | | 481.wrf | 1045 | 1101 | 1.0535885 | 3864 | 3.6976077 | | 482.sphinx3 | 604 | 602 | 0.99668874 | 1834 | 3.0364238 | |----------------+--------+------+------------+------+-----------| | Average | | | 1.0953888 | | 3.6441100 | |----------------+--------+------+------------+------+-----------|

derekbruening commented 10 years ago

From bruen...@google.com on January 18, 2012 07:31:45

compare to issue #394 comment 7

btw, can you eliminate all extra digits so the #s are easier to read in these columns

derekbruening commented 10 years ago

From bradc...@google.com on January 25, 2012 09:58:18

The writeup in the first entry of this bug is very helpful; about the right level of detail and such. Thanks!

Would it be possible to use more explicit labels for the columns? It's a little bit difficult for me to figure out what's being compared.

derekbruening commented 10 years ago

From zhao...@google.com on January 30, 2012 08:22:19

Relative slowdown without leak detection or redzone:

11:19|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2

spec2k6cmp ./result/CINT2006.024.ref.txt ./result/CINT2006.ia32.native.ref.txt 400.perlbench 3.97 ( 1603 / 404) 401.bzip2 2.72 ( 1936 / 711) 403.gcc 3.55 ( 1202 / 339) 429.mcf 1.34 ( 378 / 283) 445.gobmk 2.70 ( 1405 / 520) 456.hmmer 2.30 ( 2252 / 979) 458.sjeng 2.87 ( 1737 / 606) 462.libquantum 1.53 ( 1224 / 801) 464.h264ref 6.69 ( 5812 / 869) 471.omnetpp 3.13 ( 1003 / 320) 473.astar 2.32 ( 1236 / 533) 483.xalancbmk 4.74 ( 1243 / 262) average 3.16

11:19|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2

spec2k6cmp ./result/CFP2006.024.ref.txt ./result/CFP2006.ia32.native.ref.txt 410.bwaves 1.36 ( 809 / 597) 416.gamess 2.83 ( 3242 / 1144) 433.milc 2.07 ( 1034 / 500) 434.zeusmp 2.11 ( 1345 / 636) 435.gromacs 1.87 ( 1792 / 957) 436.cactusADM 2.17 ( 2569 / 1186) 437.leslie3d 1.23 ( 1282 / 1039) 444.namd 2.32 ( 1440 / 620) 447.dealII 3.44 ( 1726 / 502) 450.soplex 2.83 ( 876 / 309) 453.povray 4.30 ( 1251 / 291) 454.calculix 1.64 ( 2036 / 1244) 459.GemsFDTD 1.25 ( 1313 / 1054) 465.tonto 1.75 ( 1278 / 729) 470.lbm 5.33 ( 2336 / 438) 481.wrf 1.59 ( 1660 / 1045) 482.sphinx3 1.98 ( 1196 / 604) average 2.36

Overall average: 2.69

derekbruening commented 10 years ago

From zhao...@google.com on February 02, 2012 19:02:49

There are many sub-tasks for implementing pattern mode, including

adding pattern option
code instrumentation insert pattern check code for memory reference
system call parameter checks other than application instructions, we should also check the system calls' parameter which may access memory.
redzone management add redzone for memory allocation, maintain redzone for easy track, enlarge redzone on free, etc.
fault handling
add pattern mode testing
re-factoring code there are many utility functions like tls slot are implemented assuming using shadow memory, we need change them.
performance improvement

Maybe we should split the issue into several issues.

derekbruening commented 10 years ago

From bruen...@google.com on February 02, 2012 19:14:44

most of the items on your list are simply the necessary parts of getting the feature to work at all. that's what this issue covers: the end-to-end implementation of the feature.

optional augmentation such as additional optimizations beyond the original impl, or perhaps some larger cleanup or refactoring that's delayed, or other postponed or separable work that ends up not being part of the original implementation, should be filed separately so we don't forget about it.

derekbruening commented 10 years ago

From zhao...@google.com on February 14, 2012 12:35:07

Some performance evaluation: for 400.perlbench with test input from spec2k6 |----------------------------------------------------------------------+------| | native | 3.7 | |----------------------------------------------------------------------+------| | DynamoRIO (with trace) | 14.8 | |----------------------------------------------------------------------+------| | Dr.Memory (shadow mode) | | |----------------------------------------------------------------------+------| | full | 190 | | -no_count_leaks | 162 | | -no_check_uninitialized | 107 | | -no_check_uninitialized -no_count_leaks | 78.0 | | -leaks_only -no_count_leaks -no_zero_stack | 22.6 | | -perturb_only | SEGV | | -leaks_only -no_count_leaks -no_zero_stack -no_track_allocs | 13.1 | |----------------------------------------------------------------------+------| | Dr.Memory (pattern mode) (redzone 16) | | |----------------------------------------------------------------------+------| | -pattern 0xb12f | 59.6 | | -pattern 0xb12f -no_count_leaks | 34.3 | | -pattern 0xb12f -no_count_leaks (no malloc rb-tree) | 33.1 | | -pattern 0xb12f -no_count_leaks -no_track_allocs -no_replace_realloc | 16.7 | |----------------------------------------------------------------------+------| | no instrumentation (pattern_opnd_need_check return false) | | |----------------------------------------------------------------------+------| | -pattern 0xb12f -no_count_leaks | 27.3 | | -pattern 0xb12f -no_count_leaks -no_track_allocs -no_replace_realloc | 13.0 | |----------------------------------------------------------------------+------| | tune redzone (no instrumentation, no malloc rb-tree) | | | -pattern 0xb12f -no_count_leaks | | |----------------------------------------------------------------------+------| | 0x8 | 27.6 | | 0x10 | 27.4 | | 0x20 | 27.5 | | 0x40 | 27.6 | |----------------------------------------------------------------------+------|

It looks like the malloc tracking is the major overhead. For malloc overhead, also see issue #460 .

derekbruening commented 10 years ago

From zhao...@google.com on February 14, 2012 15:26:16

Two more line of data |----------------------------------------------------------------------+------| | DR -disable_traces -bb_single_restore_prefix -max_bb_instrs 256 | 9.73 | |----------------------------------------------------------------------+------| | Dr.Memory (pattern mode) (redzone 16) | | |----------------------------------------------------------------------+------| | -pattern 0xb12f -no_count_leaks -no_replace_realloc (no rb-tree) | 23.1 | |----------------------------------------------------------------------+------|

From the above, we knows that for 400.perlbench with test input

Native: 3.7
DR's overhead: +6.03 (= 9.73 - 3.7)
DrM's own overhead: +3.27 (= 13.0 - 9.73)
Pattern Instrumentation overhead: +3.7 (= 16.7 - 13.0)
alloc track overhead: +17.6 (= 34.3 - 16.7) or +13.3 (= 27.3 - 13.0)
replace overhead: +6.4 (= 23.1 - 16.7)

alloc tracking is the major overhead for perlbench.

derekbruening commented 10 years ago

From zhao...@google.com on April 10, 2012 10:51:01

Current performance:

pattern vs unaddr: 400.perlbench 0.99 ( 3442 / 3460) 401.bzip2 1.51 ( 2669 / 1771) 403.gcc 0.74 ( 1102 / 1495) 429.mcf 0.78 ( 380 / 488) 445.gobmk 0.89 ( 1697 / 1917) 456.hmmer 0.67 ( 2292 / 3434) 458.sjeng 0.75 ( 1803 / 2408) 462.libquantum 0.61 ( 1277 / 2106) 464.h264ref 0.70 ( 5809 / 8326) 471.omnetpp 1.17 ( 3139 / 2690) 473.astar 1.11 ( 1328 / 1200) 483.xalancbmk 1.05 ( 2114 / 2005) average 0.91

410.bwaves 0.70 ( 817 / 1171) 416.gamess 0.93 ( 3291 / 3520) 433.milc 1.00 ( 1030 / 1030) 434.zeusmp 0.82 ( 1361 / 1667) 435.gromacs 1.36 ( 1799 / 1320) 436.cactusADM 1.68 ( 2597 / 1545) 437.leslie3d 0.50 ( 1284 / 2553) 444.namd 1.21 ( 1452 / 1201) 447.dealII 0.97 ( 2262 / 2340) 450.soplex 0.93 ( 941 / 1008) 453.povray 1.06 ( 1324 / 1247) 454.calculix 0.67 ( 1994 / 2965) 459.GemsFDTD 0.56 ( 1325 / 2385) 465.tonto 1.05 ( 3597 / 3431) 470.lbm 3.02 ( 2296 / 761) 481.wrf 0.86 ( 3451 / 4026) 482.sphinx3 0.69 ( 1263 / 1830) average 1.06

Pattern vs Native:

400.perlbench 8.52 ( 3442 / 404) 401.bzip2 3.75 ( 2669 / 711) 403.gcc 3.25 ( 1102 / 339) 429.mcf 1.34 ( 380 / 283) 445.gobmk 3.26 ( 1697 / 520) 456.hmmer 2.34 ( 2292 / 979) 458.sjeng 2.98 ( 1803 / 606) 462.libquantum 1.59 ( 1277 / 801) 464.h264ref 6.68 ( 5809 / 869) 471.omnetpp 9.81 ( 3139 / 320) 473.astar 2.49 ( 1328 / 533) 483.xalancbmk 8.07 ( 2114 / 262) average 4.51

410.bwaves 1.37 ( 817 / 597) 416.gamess 2.88 ( 3291 / 1144) 433.milc 2.06 ( 1030 / 500) 434.zeusmp 2.14 ( 1361 / 636) 435.gromacs 1.88 ( 1799 / 957) 436.cactusADM 2.19 ( 2597 / 1186) 437.leslie3d 1.24 ( 1284 / 1039) 444.namd 2.34 ( 1452 / 620) 447.dealII 4.51 ( 2262 / 502) 450.soplex 3.05 ( 941 / 309) 453.povray 4.55 ( 1324 / 291) 454.calculix 1.60 ( 1994 / 1244) 459.GemsFDTD 1.26 ( 1325 / 1054) 465.tonto 4.93 ( 3597 / 729) 470.lbm 5.24 ( 2296 / 438) 481.wrf 3.30 ( 3451 / 1045) 482.sphinx3 2.09 ( 1263 / 604) average 2.74

derekbruening commented 10 years ago

From zhao...@google.com on April 10, 2012 11:14:41

performance test from chromium unit_tests:

Native: [==========] 3885 tests from 632 test cases ran. (236891 ms total) [ PASSED ] 3885 tests.

unaddr: [==========] 3878 tests from 632 test cases ran. (922217 ms total) [ PASSED ] 3878 tests.

pattern: [==========] 3878 tests from 632 test cases ran. (911384 ms total) [ PASSED ] 3878 tests.

derekbruening commented 10 years ago

From zhao...@google.com on April 12, 2012 08:14:28

After aflags context switch optimization:

Pattern V.S. Shadow Light

spec2k6cmp CINT2006.ia32.drm.pattern.ref.txt CINT2006.ia32.drm.light.ref.txt 400.perlbench 0.98 ( 3381 / 3460) 401.bzip2 1.56 ( 2757 / 1771) 403.gcc 0.71 ( 1066 / 1495) 429.mcf 0.75 ( 366 / 488) 445.gobmk 0.88 ( 1683 / 1917) 456.hmmer 0.66 ( 2278 / 3434) 458.sjeng 0.70 ( 1694 / 2408) 462.libquantum 0.59 ( 1241 / 2106) 464.h264ref 0.68 ( 5639 / 8326) 471.omnetpp 1.12 ( 3013 / 2690) 473.astar 1.06 ( 1277 / 1200) 483.xalancbmk 1.03 ( 2074 / 2005) average 0.89

11:11|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2/result

spec2k6cmp CFP2006.ia32.drm.pattern.ref.txt CFP2006.ia32.drm.light.ref.txt 410.bwaves 0.69 ( 810 / 1171) 416.gamess 0.68 ( 2388 / 3520) 433.milc 0.73 ( 747 / 1030) 434.zeusmp 0.59 ( 986 / 1667) 435.gromacs 0.86 ( 1140 / 1320) 436.cactusADM 0.97 ( 1503 / 1545) 437.leslie3d 0.50 ( 1282 / 2553) 444.namd 0.67 ( 810 / 1201) 447.dealII 0.89 ( 2090 / 2340) 450.soplex 0.71 ( 719 / 1008) 453.povray 0.75 ( 934 / 1247) 454.calculix 0.67 ( 2001 / 2965) 459.GemsFDTD 0.55 ( 1310 / 2385) 465.tonto 1.04 ( 3553 / 3431) 470.lbm 0.95 ( 721 / 761) 481.wrf 0.83 ( 3350 / 4026) 482.sphinx3 0.67 ( 1218 / 1830) average 0.75

Pattern V.S. Native

spec2k6cmp CINT2006.ia32.drm.pattern.ref.txt CINT2006.ia32.native.ref.txt 400.perlbench 8.37 ( 3381 / 404) 401.bzip2 3.88 ( 2757 / 711) 403.gcc 3.14 ( 1066 / 339) 429.mcf 1.29 ( 366 / 283) 445.gobmk 3.24 ( 1683 / 520) 456.hmmer 2.33 ( 2278 / 979) 458.sjeng 2.80 ( 1694 / 606) 462.libquantum 1.55 ( 1241 / 801) 464.h264ref 6.49 ( 5639 / 869) 471.omnetpp 9.42 ( 3013 / 320) 473.astar 2.40 ( 1277 / 533) 483.xalancbmk 7.92 ( 2074 / 262) average 4.40

11:13|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2/result

spec2k6cmp CFP2006.ia32.drm.pattern.ref.txt CFP2006.ia32.native.ref.txt 410.bwaves 1.36 ( 810 / 597) 416.gamess 2.09 ( 2388 / 1144) 433.milc 1.49 ( 747 / 500) 434.zeusmp 1.55 ( 986 / 636) 435.gromacs 1.19 ( 1140 / 957) 436.cactusADM 1.27 ( 1503 / 1186) 437.leslie3d 1.23 ( 1282 / 1039) 444.namd 1.31 ( 810 / 620) 447.dealII 4.16 ( 2090 / 502) 450.soplex 2.33 ( 719 / 309) 453.povray 3.21 ( 934 / 291) 454.calculix 1.61 ( 2001 / 1244) 459.GemsFDTD 1.24 ( 1310 / 1054) 465.tonto 4.87 ( 3553 / 729) 470.lbm 1.65 ( 721 / 438) 481.wrf 3.21 ( 3350 / 1045) 482.sphinx3 2.02 ( 1218 / 604) average 2.11

derekbruening commented 10 years ago

From zhao...@google.com on April 12, 2012 09:18:15

Summary of the pattern mode performance status:

Comparing to native: pattern mode is about 4.40x slowdown in SPECINT and 2.11x slowdown in SPECFP. Comparing to shadow light mode: pattern mode is 0.89x faster on SPECINT and 0.75x faster in SPECFP.

Known performance issues: 400.perlbench: malloc wrapping and managing would be the major overhead, instrumentation has little impact. 401.bzip2: such compression algorithm makes it hard to pick a good 2-byte pattern value. It also has a lot of single byte access, in which case pattern mode inserts two checks for normal and reversed pattern value, causing signficiant slowdown.

Benchmarks to be investigated:

436.cactusADM, 465.tonto, 470.lbm, 471.omnetpp, 473.astar, 483.xalancbmk, why do they show little or no improvement over shadow light mode
464.h264ref, 471.omnetpp, 483.xalancbmk, 447.dealII, 465.tonto, where the high slowdown compare to native comes from. 447.dealII seems very sensitive to code cache size.

Possible optimization

More aggressive aflags save/restore merge
Aggressive checks merge (for 470.lbm)
Moving checks around to avoid aflags save/restore
If too many fault, flush the cache and remove reverse pattern check (for 401.bzip2) ( issue #860 )
If too many fault, flush the cache and do the 4-byte checking only (for 401.bzip2) ( issue #860 )
Malloc interception/wrapping/management optimization ( issue #794 ) We might need revisit the instrumentation scheme since it is hard to perform some of the optimization in current instrumentation approach.

derekbruening commented 10 years ago

From bruen...@google.com on April 12, 2012 09:23:11

malloc interception performance improvement is issue #460

derekbruening commented 10 years ago

From rnk@google.com on April 12, 2012 09:26:14

For 1-byte accesses, maybe the slowdown is coming from unaligned 2-byte accesses. Perhaps instead we should back-align the address and see if that makes it faster.

derekbruening commented 10 years ago

From zhao...@google.com on April 12, 2012 09:31:21

back-align requires stealing more registers and more instrumentation, I am not sure if the benefit of alignment would offset the extra-overhead. In bzip2 case, it is clear that many more fault path execution when enable the reverse pattern value check on single byte access, which causes 2.9x in C2 but 3.9x in C13.

derekbruening commented 10 years ago

From zhao...@google.com on April 20, 2012 10:53:45

Malloc Intensive Benchmarks:

****\ 400.perlbench

Native Time: 404
app mallocs: 22899902, frees: 22708055, large mallocs: 3950921
app mallocs: 278729256, frees: 278630136, large mallocs: 10026
app mallocs: 59030874, frees: 57367721, large mallocs: 420422

****\ 447.dealII

Native Time: 502
app mallocs: 151332320, frees: 151332318, large mallocs: 6889

****\ 465.tonto

Native Time: 729
app mallocs: 1214789684, frees: 1214789663, large mallocs: 7604207

****\ 471.omnetpp

Native Time: 320
app mallocs: 267064936, frees: 266998684, large mallocs: 830

****\ 483.xalancbmk

Native Time: 262
app mallocs: 135155474, frees: 135155474, large mallocs: 11354

derekbruening commented 10 years ago

From zhao...@google.com on April 20, 2012 11:06:05

471.omnetpp: callstack is_retaddr: 216825270, backdecode: 216823047, unreadable: 0

471.omnetpp has a lot of callstack walk, which come new/delete mismatched bug, xref issue #862

derekbruening commented 10 years ago

From zhao...@google.com on April 20, 2012 16:41:40

for unit_test on Windows:

[----------] Global test environment tear-down [==========] 3885 tests from 632 test cases ran. (236891 ms total) [ PASSED ] 3885 tests.

YOU HAVE 73 DISABLED TESTS

Pattern: [----------] Global test environment tear-down [==========] 3878 tests from 632 test cases ran. (836474 ms total) [ PASSED ] 3878 tests.

YOU HAVE 73 DISABLED TESTS

Shadow light: [----------] Global test environment tear-down [==========] 3878 tests from 632 test cases ran. (922217 ms total) [ PASSED ] 3878 tests.

YOU HAVE 73 DISABLED TESTS

It is 3.5x to native. This pattern mode used here does not perform the aflags opt, but only removing the rb tree. Should be able to achieve ~3x.

derekbruening commented 10 years ago

From zhao...@google.com on April 23, 2012 09:08:59

There is a performance problem for using reversed pattern value on single byte access. Assuming the pattern value is 0x4321, which is 0x21, 0x43, and the reverse is 0x43, 0x21. An app allocates a 5 byte block, and the last byte happens to be the 0x43, and the followed bytes in redzone are 0x21, 0x43, 0x21, .... The reverse check will trigger the ud2a. Even worse, the expensive walk will happen.

derekbruening commented 10 years ago

From zhao...@google.com on April 25, 2012 07:16:37

For comment 21, it seems that there is no way to tell it apart from a correct one. for example: char p1 = malloc(3); char p2 = malloc(4); ... You cannot tell (p1 + 3) is an unaddressable error but (p2 + 3) is valid without looking up the malloc block.

derekbruening commented 10 years ago

From zhao...@google.com on May 09, 2012 13:47:18

On my laptop unit_tests performance after integration:

Native: [----------] Global test environment tear-down [==========] 4043 tests from 656 test cases ran. (197117 ms total) [ PASSED ] 4043 tests.

Shadow light: [----------] Global test environment tear-down [==========] 4036 tests from 656 test cases ran. (1013632 ms total) [ PASSED ] 4036 tests.

Pattern: [----------] Global test environment tear-down [==========] 4036 tests from 656 test cases ran. (785914 ms total) [ PASSED ] 4036 tests.

derekbruening commented 10 years ago

From zhao...@google.com on May 15, 2012 19:52:57

h264ref's overhead comes from reps instruction execution: ITIMER distribution (595340): 0.0% of time in INTERPRETER (99) 0.0% of time in DISPATCH (2) 6.6% of time in INDIRECT BRANCH LOOKUP (39138) 93.4% of time in FRAGMENT CACHE (556043) 0.0% of time in UNKNOWN (58)

pc=0x4babff57 #=6120 in fragment @0x080aad36 w/ offs 0x00000007 pc=0x4babff5d #=618 in fragment @0x080aad36 w/ offs 0x0000000d pc=0x4babff61 #=28126 in fragment @0x080aad36 w/ offs 0x00000011 pc=0x4babff67 #=54332 in fragment @0x080aad36 w/ offs 0x00000017 pc=0x4babff6d #=4157 in fragment @0x080aad36 w/ offs 0x0000001d pc=0x4babff7b #=6211 in fragment @0x080aad36 w/ offs 0x0000002b pc=0x4babff82 #=202 in fragment @0x080aad36 w/ offs 0x00000032 pc=0x4babff86 #=6028 in fragment @0x080aad36 w/ offs 0x00000036 pc=0x4babff91 #=6184 in fragment @0x080aad36 w/ offs 0x00000041 pc=0x4babff92 #=6040 in fragment @0x080aad36 w/ offs 0x00000042 pc=0x4babff98 #=8262 in fragment @0x080aad36 w/ offs 0x00000048 pc=0x4babff9e #=4233 in fragment @0x080aad36 w/ offs 0x0000004e pc=0x4babffa0 #=4108 in fragment @0x080aad36 w/ offs 0x00000050 pc=0x4babffa1 #=4133 in fragment @0x080aad36 w/ offs 0x00000051 pc=0x4babffa7 #=7063 in fragment @0x080aad36 w/ offs 0x00000057 pc=0x4babffab #=12224 in fragment @0x080aad36 w/ offs 0x0000005b

0x080aad36 f3 a5 rep movs %ds:(%esi) %esi %edi %ecx -> %es:(%edi) %esi %edi %ecx

TAG 0x080aad36 +0 m4 @0x4f3b1e90 64 a3 6c 00 00 00 mov %eax -> %fs:0x0000006c +6 m4 @0x4f3ab894 9f lahf -> %ah +7 m4 @0x4f3ad34c 0f 90 c0 seto -> %al +10 m4 @0x4f3b1fdc 64 a3 64 00 00 00 mov %eax -> %fs:0x00000064 +16 m4 @0x4f3b15f0 64 a1 6c 00 00 00 mov %fs:0x0000006c -> %eax +22 m4 @0x4f3b1398 e3 fe jecxz @0x4f3af75c %ecx +24 m4 @0x4f3ad8c8 eb fe jmp @0x4f3af9f4 +26 L4 @0x4f3af75c b9 01 00 00 00 mov $0x00000001 -> %ecx +31 m4 @0x4f3b0cc4 e9 fb ff ff ff jmp @0x4f3aedf0 +36 m4 @0x4f3af9f4 +36 m4 @0x4f3afd64 3e 81 3e 21 43 21 43 cmp %ds:(%esi) $0x43214321 +43 m4 @0x4f3ab3e8 75 fe jnz @0x4f3b1bac +45 m4 @0x4f3a74d0 0f 0b ud2a +47 m4 @0x4f3b1bac +47 m4 @0x4f3b1b2c 26 81 3f 21 43 21 43 cmp %es:(%edi) $0x43214321 +54 m4 @0x4f3ac4a4 75 fe jnz @0x4f3b19e0 +56 m4 @0x4f3ad4d8 0f 0b ud2a +58 m4 @0x4f3b19e0 +58 L4 @0x4f3ab518 a5 movs %ds:(%esi) %esi %edi -> %es:(%edi) %esi %edi +59 m4 @0x4f3aedf0 +59 m4 @0x4f3af504 64 a3 6c 00 00 00 mov %eax -> %fs:0x0000006c +65 m4 @0x4f3adae0 64 a1 64 00 00 00 mov %fs:0x00000064 -> %eax +71 m4 @0x4f3b1914 04 7f add $0x7f %al -> %al +73 m4 @0x4f3ac6fc 9e sahf %ah +74 m4 @0x4f3ac014 64 a1 6c 00 00 00 mov %fs:0x0000006c -> %eax +80 L4 @0x4f3b128c e2 dc loop $0x080aad36 %ecx -> %ecx END 0x080aad36

derekbruening commented 10 years ago

From zhao...@google.com on May 15, 2012 20:48:01

From the sampling, we can see that the code sequence +10 m4 @0x4f3b1fdc 64 a3 64 00 00 00 mov %eax -> %fs:0x00000064 +16 m4 @0x4f3b15f0 64 a1 6c 00 00 00 mov %fs:0x0000006c -> %eax is very slow: pc=0x4babff61 #=28126 in fragment @0x080aad36 w/ offs 0x00000011 pc=0x4babff67 #=54332 in fragment @0x080aad36 w/ offs 0x00000017

We should avoid it as much as possible, but it would make restore_state event complex.

derekbruening commented 10 years ago

From zhao...@google.com on May 15, 2012 20:56:37

For 471.omnetpp, we should stop expensive stack walking if we see too many similar error reports. For example, set a threshold for each type of errors, if the number of such error exceed the threshold, do not use callstack but only the current location for error report.

derekbruening commented 10 years ago

From zhao...@google.com on May 16, 2012 22:32:45

From c#25 we can see that the eax app save/restore for the aflags save/restore is very expensive. One simple optimization for aflags save/restore is to check if there is any eax usage in bb. If no, do not restore app's eax value. By doing so, we can easily restore aflags and app's eax value in the restore state event.

derekbruening commented 10 years ago

From zhao...@google.com on May 29, 2012 19:49:14

for 471.omnetpp ref input: 471.omnetpp 8.08 ( 2585 / 320), 8x slowdown to native:

Error #1: UNADDRESSABLE ACCESS: reading 0x084c8d60-0x084c8d70 16 byte(s) within 0x084c8d60-0x084c8d70

0 libc.so.6!_mm_load_si128 [/usr/lib/gcc/x86_64-linux-gnu/4.4.3/include/emmintrin.h:679]

1 omnetppbase.gcc43-32bit!cPar::setFromText(char const, char)

2 omnetpp_base.gcc43-32bit!cPar::read()

3 omnetpp_base.gcc43-32bit!largeNet::setupNetwork()

4 omnetppbase.gcc43-32bit!cSimulation::setupNetwork(cNetworkType, int)

5 omnetpp_base.gcc43-32bit!TCmdenvApp::run()

6 omnetpp_base.gcc43-32bit!main

7 libc.so.6!__libc_start_main [/build/buildd/eglibc-2.11.1/csu/libc-start.c:226]

8 omnetpp_base.gcc43-32bit!_start

Note: elapsed time = 0:00:00.152 in thread 4955 Note: instruction: movdqa (%eax) -> %xmm0

ERRORS FOUND: 114 unique, 17113 total unaddressable access(es) 940 unique, 216820653 total invalid heap argument(s) 0 unique, 0 total warning(s) ERRORS IGNORED:

see issue #901 , we might want to replace strspn with a simple implementation of strspn.

derekbruening commented 10 years ago

From zhao...@google.com on June 11, 2012 08:52:44

update on performance:

pattern vs light:

spec2k6cmp CINT2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CINT2006.ia32.drm.light-no-mismatch.ref.txt 400.perlbench 0.88 ( 3088 / 3504) 401.bzip2 1.21 ( 2148 / 1773) 403.gcc 0.68 ( 1015 / 1486) 429.mcf 0.70 ( 347 / 495) 445.gobmk 0.84 ( 1599 / 1914) 456.hmmer 0.67 ( 2281 / 3427) 458.sjeng 0.70 ( 1669 / 2390) 462.libquantum 0.58 ( 1219 / 2104) 464.h264ref 0.47 ( 3898 / 8311) 471.omnetpp 0.95 ( 2031 / 2145) 473.astar 1.01 ( 1219 / 1211) 483.xalancbmk 0.85 ( 1735 / 2048) average 0.79

spec2k6cmp CFP2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CFP2006.ia32.drm.light-no-mismatch.ref.txt 410.bwaves 0.68 ( 798 / 1167) 416.gamess 0.63 ( 2224 / 3520) 433.milc 0.68 ( 701 / 1026) 434.zeusmp 0.56 ( 930 / 1663) 435.gromacs 0.86 ( 1137 / 1323) 436.cactusADM 0.95 ( 1474 / 1549) 437.leslie3d 0.50 ( 1278 / 2555) 444.namd 0.67 ( 814 / 1222) 447.dealII 0.78 ( 1790 / 2289) 450.soplex 0.69 ( 689 / 1004) 453.povray 0.68 ( 849 / 1250) 454.calculix 0.65 ( 1935 / 3000) 459.GemsFDTD 0.53 ( 1282 / 2408) 465.tonto 0.84 ( 2936 / 3476) 470.lbm 0.73 ( 554 / 764) 481.wrf 0.76 ( 2972 / 3885) 482.sphinx3 0.66 ( 1200 / 1831) average 0.70

Pattern vs Native

spec2k6cmp CINT2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CINT2006.ia32.native.ref.txt 400.perlbench 7.64 ( 3088 / 404) 401.bzip2 3.02 ( 2148 / 711) 403.gcc 2.99 ( 1015 / 339) 429.mcf 1.23 ( 347 / 283) 445.gobmk 3.08 ( 1599 / 520) 456.hmmer 2.33 ( 2281 / 979) 458.sjeng 2.75 ( 1669 / 606) 462.libquantum 1.52 ( 1219 / 801) 464.h264ref 4.49 ( 3898 / 869) 471.omnetpp 6.35 ( 2031 / 320) 473.astar 2.29 ( 1219 / 533) 483.xalancbmk 6.62 ( 1735 / 262) average 3.69

11:43|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2/result

spec2k6cmp CFP2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CFP2006.ia32.native.ref.txt 410.bwaves 1.34 ( 798 / 597) 416.gamess 1.94 ( 2224 / 1144) 433.milc 1.40 ( 701 / 500) 434.zeusmp 1.46 ( 930 / 636) 435.gromacs 1.19 ( 1137 / 957) 436.cactusADM 1.24 ( 1474 / 1186) 437.leslie3d 1.23 ( 1278 / 1039) 444.namd 1.31 ( 814 / 620) 447.dealII 3.57 ( 1790 / 502) 450.soplex 2.23 ( 689 / 309) 453.povray 2.92 ( 849 / 291) 454.calculix 1.56 ( 1935 / 1244) 459.GemsFDTD 1.22 ( 1282 / 1054) 465.tonto 4.03 ( 2936 / 729) 470.lbm 1.26 ( 554 / 438) 481.wrf 2.84 ( 2972 / 1045) 482.sphinx3 1.99 ( 1200 / 604) average 1.93

the slow ones are 400.perlbench, 471.omnetpp, 483.xalancbmk, 447.dealII, 465.tonto, which are all memory allocation intensive ones.

derekbruening commented 10 years ago

From bruen...@google.com on June 11, 2012 09:00:04

what are the #s with -replace_malloc?

derekbruening commented 10 years ago

From zhao...@google.com on June 11, 2012 09:01:57

no, it just wrapping the malloc.

derekbruening commented 10 years ago

From zhao...@google.com on June 12, 2012 08:39:52

The performance improvement on Chrome is small:

[----------] Global test environment tear-down [==========] 4036 tests from 656 test cases ran. (747455 ms total) [ PASSED ] 4035 tests. [ FAILED ] 1 test, listed below: [ FAILED ] HistoryQuickProviderTest.VisitCountMatches

derekbruening commented 10 years ago

From zhao...@google.com on June 12, 2012 17:47:23

On my window desktop:

shadow light: [----------] Global test environment tear-down [==========] 4306 tests from 685 test cases ran. (1210092 ms total) [ PASSED ] 4306 tests.

pattern: [----------] Global test environment tear-down [==========] 4306 tests from 685 test cases ran. (915119 ms total) [ PASSED ] 4303 tests.

native: [----------] Global test environment tear-down [==========] 4313 tests from 685 test cases ran. (248993 ms total) [ PASSED ] 4313 tests.

derekbruening commented 10 years ago

From bruen...@google.com on May 14, 2013 17:09:24

pattern mode with -replace_malloc vs pattern wrap, native, and light replace:

spec2k6cmpave namedres/x86.drmem.pattern_replace namedres/x86.drmem.pattern 400.perlbench 0.68 ( 2113 / 3105) 401.bzip2 1.01 ( 2299 / 2286) 403.gcc 1.05 ( 1049 / 1002) 429.mcf 1.01 ( 356 / 354) 445.gobmk 0.96 ( 1636 / 1706) 456.hmmer 1.00 ( 1387 / 1391) 458.sjeng 0.96 ( 1663 / 1728) 462.libquantum 1.01 ( 938 / 931) 464.h264ref 0.84 ( 4815 / 5736) 471.omnetpp 0.64 ( 1168 / 1812) 473.astar 0.98 ( 1093 / 1114) 483.xalancbmk 0.77 ( 1340 / 1750) 410.bwaves 1.00 ( 767 / 765) 416.gamess 0.99 ( 2092 / 2108) 433.milc 1.02 ( 767 / 750) 434.zeusmp 1.00 ( 817 / 815) 435.gromacs 1.00 ( 1107 / 1109) 436.cactusADM 1.02 ( 1390 / 1368) 437.leslie3d 1.00 ( 753 / 752) 444.namd 1.00 ( 732 / 733) 447.dealII 0.85 ( 1542 / 1807) 450.soplex 0.98 ( 570 / 584) 453.povray 0.97 ( 772 / 792) 454.calculix 1.02 ( 1844 / 1812) 459.GemsFDTD 1.00 ( 757 / 758) 465.tonto 0.81 ( 2315 / 2855) 470.lbm 0.99 ( 551 / 555) 481.wrf 0.86 ( 1846 / 2137) 482.sphinx3 1.03 ( 1107 / 1079)

spec2k6cmpave namedres/x86.drmem.pattern_replace namedres/x86.native/ 400.perlbench 5.47 ( 2113 / 386) 401.bzip2 3.44 ( 2299 / 668) 403.gcc 3.04 ( 1049 / 345) 429.mcf 1.30 ( 356 / 274) 445.gobmk 3.23 ( 1636 / 506) 456.hmmer 2.40 ( 1387 / 579) 458.sjeng 2.78 ( 1663 / 598) 462.libquantum 1.37 ( 938 / 686) 464.h264ref 6.14 ( 4815 / 784) 471.omnetpp 3.91 ( 1168 / 299) 473.astar 2.11 ( 1093 / 517) 483.xalancbmk 5.19 ( 1340 / 258) 410.bwaves 1.37 ( 767 / 558) 416.gamess 2.01 ( 2092 / 1041) 433.milc 1.53 ( 767 / 502) 434.zeusmp 1.40 ( 817 / 583) 435.gromacs 1.16 ( 1107 / 952) 436.cactusADM 1.24 ( 1390 / 1117) 437.leslie3d 1.37 ( 753 / 550) 444.namd 1.36 ( 732 / 540) 447.dealII 3.05 ( 1542 / 506) 450.soplex 1.99 ( 570 / 286) 453.povray 2.86 ( 772 / 270) 454.calculix 1.75 ( 1844 / 1056) 459.GemsFDTD 1.47 ( 757 / 515) 465.tonto 3.40 ( 2315 / 681) 470.lbm 1.52 ( 551 / 362) 481.wrf 2.03 ( 1846 / 909) 482.sphinx3 2.09 ( 1107 / 529)

spec2k6cmpave namedres/x86.drmem.pattern_replace namedres/x86.drmem.light_replace 400.perlbench 0.86 ( 2113 / 2467) 401.bzip2 1.20 ( 2299 / 1923) 403.gcc 0.80 ( 1049 / 1306) 429.mcf 0.77 ( 356 / 463) 445.gobmk 0.88 ( 1636 / 1863) 456.hmmer 0.59 ( 1387 / 2355) 458.sjeng 0.69 ( 1663 / 2401) 462.libquantum 0.59 ( 938 / 1580) 464.h264ref 0.58 ( 4815 / 8337) 471.omnetpp 0.58 ( 1168 / 2013) 473.astar 0.93 ( 1093 / 1181) 483.xalancbmk 0.84 ( 1340 / 1600) 410.bwaves 0.65 ( 767 / 1189) 416.gamess 0.65 ( 2092 / 3243) 433.milc 0.76 ( 767 / 1013) 434.zeusmp 0.64 ( 817 / 1279) 435.gromacs 0.88 ( 1107 / 1258) 436.cactusADM 0.96 ( 1390 / 1448) 437.leslie3d 0.65 ( 753 / 1162) 444.namd 0.66 ( 732 / 1115) 447.dealII 0.69 ( 1542 / 2251) 450.soplex 0.66 ( 570 / 866) 453.povray 0.71 ( 772 / 1090) 454.calculix 0.65 ( 1844 / 2830) 459.GemsFDTD 0.68 ( 757 / 1114) 465.tonto 0.97 ( 2315 / 2378) 470.lbm 0.79 ( 551 / 700) 481.wrf 0.74 ( 1846 / 2484) 482.sphinx3 0.63 ( 1107 / 1760)

DynamoRIO / drmemory

pattern-based unaddressable-only mode #750

0 libc.so.6!_mm_load_si128 [/usr/lib/gcc/x86_64-linux-gnu/4.4.3/include/emmintrin.h:679]

1 omnetppbase.gcc43-32bit!cPar::setFromText(char const, char)

2 omnetpp_base.gcc43-32bit!cPar::read()

3 omnetpp_base.gcc43-32bit!largeNet::setupNetwork()

4 omnetppbase.gcc43-32bit!cSimulation::setupNetwork(cNetworkType, int)

5 omnetpp_base.gcc43-32bit!TCmdenvApp::run()

6 omnetpp_base.gcc43-32bit!main

7 libc.so.6!__libc_start_main [/build/buildd/eglibc-2.11.1/csu/libc-start.c:226]

8 omnetpp_base.gcc43-32bit!_start