Open derekbruening opened 10 years ago
From bruen...@google.com on January 17, 2012 08:26:54
see issue #151 comment 2 and comment 4 for some numbers to try and beat on unit_tests
From zhao...@google.com on January 17, 2012 11:50:30
Preliminary results:
|----------------+--------+------+------------+-------+-----------+------+-----------| | Benchmarks | Native | DR | DR/Na | bbcnt | bbc/Na | DrM | DrM/Na | |----------------+--------+------+------------+-------+-----------+------+-----------| | 400.perlbench | 404 | 609 | 1.5074257 | 1230 | 3.0445545 | | | | 401.bzip2 | 711 | 751 | 1.0562588 | 1131 | 1.5907173 | 2047 | 2.8790436 | | 403.gcc | 339 | 459 | 1.3539823 | 816 | 2.4070796 | 950 | 2.8023599 | | 429.mcf | 283 | 286 | 1.0106007 | 322 | 1.1378092 | 371 | 1.3109541 | | 445.gobmk | 520 | 740 | 1.4230769 | 962 | 1.85 | 1241 | 2.3865385 | | 456.hmmer | 979 | 966 | 0.98672114 | 1301 | 1.3289070 | 2204 | 2.2512768 | | 458.sjeng | 606 | 801 | 1.3217822 | 1442 | 2.3795380 | 1383 | 2.2821782 | | 462.libquantum | 801 | 806 | 1.0062422 | 997 | 1.2446941 | 1178 | 1.4706617 | | 464.h264ref | 869 | 1063 | 1.2232451 | 1482 | 1.7054085 | 2694 | 3.1001151 | | 471.omnetpp | 320 | 373 | 1.165625 | 704 | 2.2 | 1795 | 5.609375 | | 473.astar | 533 | 541 | 1.0150094 | 670 | 1.2570356 | 1141 | 2.1407129 | | 483.xalancbmk | 262 | 325 | 1.2404580 | 781 | 2.9809160 | 1373 | 5.2404580 | |----------------+--------+------+------------+-------+-----------+------+-----------| | 410.bwaves | 597 | 599 | 1.0033501 | 655 | 1.0971524 | 792 | 1.3266332 | | 416.gamess | 1144 | 1176 | 1.0279720 | 1747 | 1.5270979 | 3128 | 2.7342657 | | 433.milc | 500 | 516 | 1.032 | 607 | 1.214 | 1005 | 2.01 | | 434.zeusmp | 636 | 638 | 1.0031447 | 1193 | 1.8757862 | 1321 | 2.0770440 | | 435.gromacs | 957 | 966 | 1.0094044 | 1034 | 1.0804598 | 1768 | 1.8474399 | | 436.cactusADM | 1186 | 1194 | 1.0067454 | 1218 | 1.0269815 | 2570 | 2.1669477 | | 437.leslie3d | 1039 | 1043 | 1.0038499 | 1102 | 1.0606352 | 1285 | 1.2367661 | | 444.namd | 620 | 620 | 1 | 697 | 1.1241935 | 1431 | 2.3080645 | | 447.dealII | 502 | 507 | 1.0099602 | 1040 | 2.0717131 | 1755 | 3.4960159 | | 450.soplex | 309 | 324 | 1.0485437 | 639 | 2.0679612 | 836 | 2.7055016 | | 453.povray | 291 | 334 | 1.1477663 | 623 | 2.1408935 | 1040 | 3.5738832 | | 454.calculix | 1244 | 1238 | 0.99517685 | 1549 | 1.2451768 | | | | 459.GemsFDTD | 1054 | 1072 | 1.0170778 | 1114 | 1.0569260 | | | | 465.tonto | 729 | 794 | 1.0891632 | 1182 | 1.6213992 | | | | 470.lbm | 438 | 443 | 1.0114155 | 471 | 1.0753425 | 2337 | 5.3356164 | | 481.wrf | 1045 | 1101 | 1.0535885 | 1485 | 1.4210526 | 3102 | 2.9684211 | | 482.sphinx3 | 604 | 602 | 0.99668874 | 872 | 1.4437086 | 1174 | 1.9437086 | |----------------+--------+------+------------+-------+-----------+------+-----------| | Average | | | 1.0953888 | | 1.6302462 | | 2.6881593 | |----------------+--------+------+------------+-------+-----------+------+-----------|
From zhao...@google.com on January 18, 2012 07:18:36
Performance of original shadow memory based approach (nouninit,noleak):
|----------------+--------+------+------------+------+-----------| | Benchmarks | Native | DR | DR/Na | DrM | DrM/Na | |----------------+--------+------+------------+------+-----------| | 400.perlbench | 404 | 609 | 1.5074257 | 3473 | 8.5965347 | | 401.bzip2 | 711 | 751 | 1.0562588 | 1774 | 2.4950774 | | 403.gcc | 339 | 459 | 1.3539823 | 1485 | 4.3805310 | | 429.mcf | 283 | 286 | 1.0106007 | 493 | 1.7420495 | | 445.gobmk | 520 | 740 | 1.4230769 | 1930 | 3.7115385 | | 456.hmmer | 979 | 966 | 0.98672114 | 3421 | 3.4943820 | | 458.sjeng | 606 | 801 | 1.3217822 | 2417 | 3.9884488 | | 462.libquantum | 801 | 806 | 1.0062422 | 2106 | 2.6292135 | | 464.h264ref | 869 | 1063 | 1.2232451 | 8317 | 9.5707710 | | 471.omnetpp | 320 | 373 | 1.165625 | 2743 | 8.571875 | | 473.astar | 533 | 541 | 1.0150094 | 1214 | 2.2776735 | | 483.xalancbmk | 262 | 325 | 1.2404580 | 2014 | 7.6870229 | |----------------+--------+------+------------+------+-----------| | 410.bwaves | 597 | 599 | 1.0033501 | 1169 | 1.9581240 | | 416.gamess | 1144 | 1176 | 1.0279720 | 3505 | 3.0638112 | | 433.milc | 500 | 516 | 1.032 | 1030 | 2.06 | | 434.zeusmp | 636 | 638 | 1.0031447 | 1662 | 2.6132075 | | 435.gromacs | 957 | 966 | 1.0094044 | 1320 | 1.3793103 | | 436.cactusADM | 1186 | 1194 | 1.0067454 | 1540 | 1.2984823 | | 437.leslie3d | 1039 | 1043 | 1.0038499 | 2555 | 2.4590953 | | 444.namd | 620 | 620 | 1 | 1201 | 1.9370968 | | 447.dealII | 502 | 507 | 1.0099602 | 2259 | 4.5 | | 450.soplex | 309 | 324 | 1.0485437 | 1008 | 3.2621359 | | 453.povray | 291 | 334 | 1.1477663 | 1237 | 4.2508591 | | 454.calculix | 1244 | 1238 | 0.99517685 | 3042 | 2.4453376 | | 459.GemsFDTD | 1054 | 1072 | 1.0170778 | 2368 | 2.2466793 | | 465.tonto | 729 | 794 | 1.0891632 | 3335 | 4.5747599 | | 470.lbm | 438 | 443 | 1.0114155 | 767 | 1.7511416 | | 481.wrf | 1045 | 1101 | 1.0535885 | 3864 | 3.6976077 | | 482.sphinx3 | 604 | 602 | 0.99668874 | 1834 | 3.0364238 | |----------------+--------+------+------------+------+-----------| | Average | | | 1.0953888 | | 3.6441100 | |----------------+--------+------+------------+------+-----------|
From bruen...@google.com on January 18, 2012 07:31:45
compare to issue #394 comment 7
btw, can you eliminate all extra digits so the #s are easier to read in these columns
From bradc...@google.com on January 25, 2012 09:58:18
The writeup in the first entry of this bug is very helpful; about the right level of detail and such. Thanks!
Would it be possible to use more explicit labels for the columns? It's a little bit difficult for me to figure out what's being compared.
From zhao...@google.com on January 30, 2012 08:22:19
Relative slowdown without leak detection or redzone:
11:19|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2
spec2k6cmp ./result/CINT2006.024.ref.txt ./result/CINT2006.ia32.native.ref.txt 400.perlbench 3.97 ( 1603 / 404) 401.bzip2 2.72 ( 1936 / 711) 403.gcc 3.55 ( 1202 / 339) 429.mcf 1.34 ( 378 / 283) 445.gobmk 2.70 ( 1405 / 520) 456.hmmer 2.30 ( 2252 / 979) 458.sjeng 2.87 ( 1737 / 606) 462.libquantum 1.53 ( 1224 / 801) 464.h264ref 6.69 ( 5812 / 869) 471.omnetpp 3.13 ( 1003 / 320) 473.astar 2.32 ( 1236 / 533) 483.xalancbmk 4.74 ( 1243 / 262) average 3.16
11:19|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2
spec2k6cmp ./result/CFP2006.024.ref.txt ./result/CFP2006.ia32.native.ref.txt 410.bwaves 1.36 ( 809 / 597) 416.gamess 2.83 ( 3242 / 1144) 433.milc 2.07 ( 1034 / 500) 434.zeusmp 2.11 ( 1345 / 636) 435.gromacs 1.87 ( 1792 / 957) 436.cactusADM 2.17 ( 2569 / 1186) 437.leslie3d 1.23 ( 1282 / 1039) 444.namd 2.32 ( 1440 / 620) 447.dealII 3.44 ( 1726 / 502) 450.soplex 2.83 ( 876 / 309) 453.povray 4.30 ( 1251 / 291) 454.calculix 1.64 ( 2036 / 1244) 459.GemsFDTD 1.25 ( 1313 / 1054) 465.tonto 1.75 ( 1278 / 729) 470.lbm 5.33 ( 2336 / 438) 481.wrf 1.59 ( 1660 / 1045) 482.sphinx3 1.98 ( 1196 / 604) average 2.36
Overall average: 2.69
From zhao...@google.com on February 02, 2012 19:02:49
There are many sub-tasks for implementing pattern mode, including
Maybe we should split the issue into several issues.
From bruen...@google.com on February 02, 2012 19:14:44
most of the items on your list are simply the necessary parts of getting the feature to work at all. that's what this issue covers: the end-to-end implementation of the feature.
optional augmentation such as additional optimizations beyond the original impl, or perhaps some larger cleanup or refactoring that's delayed, or other postponed or separable work that ends up not being part of the original implementation, should be filed separately so we don't forget about it.
From zhao...@google.com on February 14, 2012 12:35:07
Some performance evaluation: for 400.perlbench with test input from spec2k6 |----------------------------------------------------------------------+------| | native | 3.7 | |----------------------------------------------------------------------+------| | DynamoRIO (with trace) | 14.8 | |----------------------------------------------------------------------+------| | Dr.Memory (shadow mode) | | |----------------------------------------------------------------------+------| | full | 190 | | -no_count_leaks | 162 | | -no_check_uninitialized | 107 | | -no_check_uninitialized -no_count_leaks | 78.0 | | -leaks_only -no_count_leaks -no_zero_stack | 22.6 | | -perturb_only | SEGV | | -leaks_only -no_count_leaks -no_zero_stack -no_track_allocs | 13.1 | |----------------------------------------------------------------------+------| | Dr.Memory (pattern mode) (redzone 16) | | |----------------------------------------------------------------------+------| | -pattern 0xb12f | 59.6 | | -pattern 0xb12f -no_count_leaks | 34.3 | | -pattern 0xb12f -no_count_leaks (no malloc rb-tree) | 33.1 | | -pattern 0xb12f -no_count_leaks -no_track_allocs -no_replace_realloc | 16.7 | |----------------------------------------------------------------------+------| | no instrumentation (pattern_opnd_need_check return false) | | |----------------------------------------------------------------------+------| | -pattern 0xb12f -no_count_leaks | 27.3 | | -pattern 0xb12f -no_count_leaks -no_track_allocs -no_replace_realloc | 13.0 | |----------------------------------------------------------------------+------| | tune redzone (no instrumentation, no malloc rb-tree) | | | -pattern 0xb12f -no_count_leaks | | |----------------------------------------------------------------------+------| | 0x8 | 27.6 | | 0x10 | 27.4 | | 0x20 | 27.5 | | 0x40 | 27.6 | |----------------------------------------------------------------------+------|
It looks like the malloc tracking is the major overhead. For malloc overhead, also see issue #460 .
From zhao...@google.com on February 14, 2012 15:26:16
Two more line of data |----------------------------------------------------------------------+------| | DR -disable_traces -bb_single_restore_prefix -max_bb_instrs 256 | 9.73 | |----------------------------------------------------------------------+------| | Dr.Memory (pattern mode) (redzone 16) | | |----------------------------------------------------------------------+------| | -pattern 0xb12f -no_count_leaks -no_replace_realloc (no rb-tree) | 23.1 | |----------------------------------------------------------------------+------|
From the above, we knows that for 400.perlbench with test input
alloc tracking is the major overhead for perlbench.
From zhao...@google.com on April 10, 2012 10:51:01
Current performance:
pattern vs unaddr: 400.perlbench 0.99 ( 3442 / 3460) 401.bzip2 1.51 ( 2669 / 1771) 403.gcc 0.74 ( 1102 / 1495) 429.mcf 0.78 ( 380 / 488) 445.gobmk 0.89 ( 1697 / 1917) 456.hmmer 0.67 ( 2292 / 3434) 458.sjeng 0.75 ( 1803 / 2408) 462.libquantum 0.61 ( 1277 / 2106) 464.h264ref 0.70 ( 5809 / 8326) 471.omnetpp 1.17 ( 3139 / 2690) 473.astar 1.11 ( 1328 / 1200) 483.xalancbmk 1.05 ( 2114 / 2005) average 0.91
410.bwaves 0.70 ( 817 / 1171) 416.gamess 0.93 ( 3291 / 3520) 433.milc 1.00 ( 1030 / 1030) 434.zeusmp 0.82 ( 1361 / 1667) 435.gromacs 1.36 ( 1799 / 1320) 436.cactusADM 1.68 ( 2597 / 1545) 437.leslie3d 0.50 ( 1284 / 2553) 444.namd 1.21 ( 1452 / 1201) 447.dealII 0.97 ( 2262 / 2340) 450.soplex 0.93 ( 941 / 1008) 453.povray 1.06 ( 1324 / 1247) 454.calculix 0.67 ( 1994 / 2965) 459.GemsFDTD 0.56 ( 1325 / 2385) 465.tonto 1.05 ( 3597 / 3431) 470.lbm 3.02 ( 2296 / 761) 481.wrf 0.86 ( 3451 / 4026) 482.sphinx3 0.69 ( 1263 / 1830) average 1.06
Pattern vs Native:
400.perlbench 8.52 ( 3442 / 404) 401.bzip2 3.75 ( 2669 / 711) 403.gcc 3.25 ( 1102 / 339) 429.mcf 1.34 ( 380 / 283) 445.gobmk 3.26 ( 1697 / 520) 456.hmmer 2.34 ( 2292 / 979) 458.sjeng 2.98 ( 1803 / 606) 462.libquantum 1.59 ( 1277 / 801) 464.h264ref 6.68 ( 5809 / 869) 471.omnetpp 9.81 ( 3139 / 320) 473.astar 2.49 ( 1328 / 533) 483.xalancbmk 8.07 ( 2114 / 262) average 4.51
410.bwaves 1.37 ( 817 / 597) 416.gamess 2.88 ( 3291 / 1144) 433.milc 2.06 ( 1030 / 500) 434.zeusmp 2.14 ( 1361 / 636) 435.gromacs 1.88 ( 1799 / 957) 436.cactusADM 2.19 ( 2597 / 1186) 437.leslie3d 1.24 ( 1284 / 1039) 444.namd 2.34 ( 1452 / 620) 447.dealII 4.51 ( 2262 / 502) 450.soplex 3.05 ( 941 / 309) 453.povray 4.55 ( 1324 / 291) 454.calculix 1.60 ( 1994 / 1244) 459.GemsFDTD 1.26 ( 1325 / 1054) 465.tonto 4.93 ( 3597 / 729) 470.lbm 5.24 ( 2296 / 438) 481.wrf 3.30 ( 3451 / 1045) 482.sphinx3 2.09 ( 1263 / 604) average 2.74
From zhao...@google.com on April 10, 2012 11:14:41
performance test from chromium unit_tests:
Native: [==========] 3885 tests from 632 test cases ran. (236891 ms total) [ PASSED ] 3885 tests.
unaddr: [==========] 3878 tests from 632 test cases ran. (922217 ms total) [ PASSED ] 3878 tests.
pattern: [==========] 3878 tests from 632 test cases ran. (911384 ms total) [ PASSED ] 3878 tests.
From zhao...@google.com on April 12, 2012 08:14:28
After aflags context switch optimization:
Pattern V.S. Shadow Light
spec2k6cmp CINT2006.ia32.drm.pattern.ref.txt CINT2006.ia32.drm.light.ref.txt 400.perlbench 0.98 ( 3381 / 3460) 401.bzip2 1.56 ( 2757 / 1771) 403.gcc 0.71 ( 1066 / 1495) 429.mcf 0.75 ( 366 / 488) 445.gobmk 0.88 ( 1683 / 1917) 456.hmmer 0.66 ( 2278 / 3434) 458.sjeng 0.70 ( 1694 / 2408) 462.libquantum 0.59 ( 1241 / 2106) 464.h264ref 0.68 ( 5639 / 8326) 471.omnetpp 1.12 ( 3013 / 2690) 473.astar 1.06 ( 1277 / 1200) 483.xalancbmk 1.03 ( 2074 / 2005) average 0.89
11:11|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2/result
spec2k6cmp CFP2006.ia32.drm.pattern.ref.txt CFP2006.ia32.drm.light.ref.txt 410.bwaves 0.69 ( 810 / 1171) 416.gamess 0.68 ( 2388 / 3520) 433.milc 0.73 ( 747 / 1030) 434.zeusmp 0.59 ( 986 / 1667) 435.gromacs 0.86 ( 1140 / 1320) 436.cactusADM 0.97 ( 1503 / 1545) 437.leslie3d 0.50 ( 1282 / 2553) 444.namd 0.67 ( 810 / 1201) 447.dealII 0.89 ( 2090 / 2340) 450.soplex 0.71 ( 719 / 1008) 453.povray 0.75 ( 934 / 1247) 454.calculix 0.67 ( 2001 / 2965) 459.GemsFDTD 0.55 ( 1310 / 2385) 465.tonto 1.04 ( 3553 / 3431) 470.lbm 0.95 ( 721 / 761) 481.wrf 0.83 ( 3350 / 4026) 482.sphinx3 0.67 ( 1218 / 1830) average 0.75
Pattern V.S. Native
spec2k6cmp CINT2006.ia32.drm.pattern.ref.txt CINT2006.ia32.native.ref.txt 400.perlbench 8.37 ( 3381 / 404) 401.bzip2 3.88 ( 2757 / 711) 403.gcc 3.14 ( 1066 / 339) 429.mcf 1.29 ( 366 / 283) 445.gobmk 3.24 ( 1683 / 520) 456.hmmer 2.33 ( 2278 / 979) 458.sjeng 2.80 ( 1694 / 606) 462.libquantum 1.55 ( 1241 / 801) 464.h264ref 6.49 ( 5639 / 869) 471.omnetpp 9.42 ( 3013 / 320) 473.astar 2.40 ( 1277 / 533) 483.xalancbmk 7.92 ( 2074 / 262) average 4.40
11:13|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2/result
spec2k6cmp CFP2006.ia32.drm.pattern.ref.txt CFP2006.ia32.native.ref.txt 410.bwaves 1.36 ( 810 / 597) 416.gamess 2.09 ( 2388 / 1144) 433.milc 1.49 ( 747 / 500) 434.zeusmp 1.55 ( 986 / 636) 435.gromacs 1.19 ( 1140 / 957) 436.cactusADM 1.27 ( 1503 / 1186) 437.leslie3d 1.23 ( 1282 / 1039) 444.namd 1.31 ( 810 / 620) 447.dealII 4.16 ( 2090 / 502) 450.soplex 2.33 ( 719 / 309) 453.povray 3.21 ( 934 / 291) 454.calculix 1.61 ( 2001 / 1244) 459.GemsFDTD 1.24 ( 1310 / 1054) 465.tonto 4.87 ( 3553 / 729) 470.lbm 1.65 ( 721 / 438) 481.wrf 3.21 ( 3350 / 1045) 482.sphinx3 2.02 ( 1218 / 604) average 2.11
From zhao...@google.com on April 12, 2012 09:18:15
Summary of the pattern mode performance status:
Comparing to native: pattern mode is about 4.40x slowdown in SPECINT and 2.11x slowdown in SPECFP. Comparing to shadow light mode: pattern mode is 0.89x faster on SPECINT and 0.75x faster in SPECFP.
Known performance issues: 400.perlbench: malloc wrapping and managing would be the major overhead, instrumentation has little impact. 401.bzip2: such compression algorithm makes it hard to pick a good 2-byte pattern value. It also has a lot of single byte access, in which case pattern mode inserts two checks for normal and reversed pattern value, causing signficiant slowdown.
Benchmarks to be investigated:
Possible optimization
From bruen...@google.com on April 12, 2012 09:23:11
malloc interception performance improvement is issue #460
From rnk@google.com on April 12, 2012 09:26:14
For 1-byte accesses, maybe the slowdown is coming from unaligned 2-byte accesses. Perhaps instead we should back-align the address and see if that makes it faster.
From zhao...@google.com on April 12, 2012 09:31:21
back-align requires stealing more registers and more instrumentation, I am not sure if the benefit of alignment would offset the extra-overhead. In bzip2 case, it is clear that many more fault path execution when enable the reverse pattern value check on single byte access, which causes 2.9x in C2 but 3.9x in C13.
From zhao...@google.com on April 20, 2012 10:53:45
Malloc Intensive Benchmarks:
****\ 400.perlbench
****\ 447.dealII
****\ 465.tonto
****\ 471.omnetpp
****\ 483.xalancbmk
From zhao...@google.com on April 20, 2012 11:06:05
471.omnetpp: callstack is_retaddr: 216825270, backdecode: 216823047, unreadable: 0
471.omnetpp has a lot of callstack walk, which come new/delete mismatched bug, xref issue #862
From zhao...@google.com on April 20, 2012 16:41:40
for unit_test on Windows:
[----------] Global test environment tear-down [==========] 3885 tests from 632 test cases ran. (236891 ms total) [ PASSED ] 3885 tests.
YOU HAVE 73 DISABLED TESTS
Pattern: [----------] Global test environment tear-down [==========] 3878 tests from 632 test cases ran. (836474 ms total) [ PASSED ] 3878 tests.
YOU HAVE 73 DISABLED TESTS
Shadow light: [----------] Global test environment tear-down [==========] 3878 tests from 632 test cases ran. (922217 ms total) [ PASSED ] 3878 tests.
YOU HAVE 73 DISABLED TESTS
It is 3.5x to native. This pattern mode used here does not perform the aflags opt, but only removing the rb tree. Should be able to achieve ~3x.
From zhao...@google.com on April 23, 2012 09:08:59
There is a performance problem for using reversed pattern value on single byte access. Assuming the pattern value is 0x4321, which is 0x21, 0x43, and the reverse is 0x43, 0x21. An app allocates a 5 byte block, and the last byte happens to be the 0x43, and the followed bytes in redzone are 0x21, 0x43, 0x21, .... The reverse check will trigger the ud2a. Even worse, the expensive walk will happen.
From zhao...@google.com on April 25, 2012 07:16:37
For comment 21, it seems that there is no way to tell it apart from a correct one. for example: char p1 = malloc(3); char p2 = malloc(4); ... You cannot tell (p1 + 3) is an unaddressable error but (p2 + 3) is valid without looking up the malloc block.
From zhao...@google.com on May 09, 2012 13:47:18
On my laptop unit_tests performance after integration:
Native: [----------] Global test environment tear-down [==========] 4043 tests from 656 test cases ran. (197117 ms total) [ PASSED ] 4043 tests.
Shadow light: [----------] Global test environment tear-down [==========] 4036 tests from 656 test cases ran. (1013632 ms total) [ PASSED ] 4036 tests.
Pattern: [----------] Global test environment tear-down [==========] 4036 tests from 656 test cases ran. (785914 ms total) [ PASSED ] 4036 tests.
From zhao...@google.com on May 15, 2012 19:52:57
h264ref's overhead comes from reps instruction execution: ITIMER distribution (595340): 0.0% of time in INTERPRETER (99) 0.0% of time in DISPATCH (2) 6.6% of time in INDIRECT BRANCH LOOKUP (39138) 93.4% of time in FRAGMENT CACHE (556043) 0.0% of time in UNKNOWN (58)
pc=0x4babff57 #=6120 in fragment @0x080aad36 w/ offs 0x00000007 pc=0x4babff5d #=618 in fragment @0x080aad36 w/ offs 0x0000000d pc=0x4babff61 #=28126 in fragment @0x080aad36 w/ offs 0x00000011 pc=0x4babff67 #=54332 in fragment @0x080aad36 w/ offs 0x00000017 pc=0x4babff6d #=4157 in fragment @0x080aad36 w/ offs 0x0000001d pc=0x4babff7b #=6211 in fragment @0x080aad36 w/ offs 0x0000002b pc=0x4babff82 #=202 in fragment @0x080aad36 w/ offs 0x00000032 pc=0x4babff86 #=6028 in fragment @0x080aad36 w/ offs 0x00000036 pc=0x4babff91 #=6184 in fragment @0x080aad36 w/ offs 0x00000041 pc=0x4babff92 #=6040 in fragment @0x080aad36 w/ offs 0x00000042 pc=0x4babff98 #=8262 in fragment @0x080aad36 w/ offs 0x00000048 pc=0x4babff9e #=4233 in fragment @0x080aad36 w/ offs 0x0000004e pc=0x4babffa0 #=4108 in fragment @0x080aad36 w/ offs 0x00000050 pc=0x4babffa1 #=4133 in fragment @0x080aad36 w/ offs 0x00000051 pc=0x4babffa7 #=7063 in fragment @0x080aad36 w/ offs 0x00000057 pc=0x4babffab #=12224 in fragment @0x080aad36 w/ offs 0x0000005b
0x080aad36 f3 a5 rep movs %ds:(%esi) %esi %edi %ecx -> %es:(%edi) %esi %edi %ecx
TAG 0x080aad36 +0 m4 @0x4f3b1e90 64 a3 6c 00 00 00 mov %eax -> %fs:0x0000006c +6 m4 @0x4f3ab894 9f lahf -> %ah +7 m4 @0x4f3ad34c 0f 90 c0 seto -> %al +10 m4 @0x4f3b1fdc 64 a3 64 00 00 00 mov %eax -> %fs:0x00000064 +16 m4 @0x4f3b15f0 64 a1 6c 00 00 00 mov %fs:0x0000006c -> %eax +22 m4 @0x4f3b1398 e3 fe jecxz @0x4f3af75c %ecx +24 m4 @0x4f3ad8c8 eb fe jmp @0x4f3af9f4 +26 L4 @0x4f3af75c b9 01 00 00 00 mov $0x00000001 -> %ecx +31 m4 @0x4f3b0cc4 e9 fb ff ff ff jmp @0x4f3aedf0 +36 m4 @0x4f3af9f4
From zhao...@google.com on May 15, 2012 20:48:01
From the sampling, we can see that the code sequence +10 m4 @0x4f3b1fdc 64 a3 64 00 00 00 mov %eax -> %fs:0x00000064 +16 m4 @0x4f3b15f0 64 a1 6c 00 00 00 mov %fs:0x0000006c -> %eax is very slow: pc=0x4babff61 #=28126 in fragment @0x080aad36 w/ offs 0x00000011 pc=0x4babff67 #=54332 in fragment @0x080aad36 w/ offs 0x00000017
We should avoid it as much as possible, but it would make restore_state event complex.
From zhao...@google.com on May 15, 2012 20:56:37
For 471.omnetpp, we should stop expensive stack walking if we see too many similar error reports. For example, set a threshold for each type of errors, if the number of such error exceed the threshold, do not use callstack but only the current location for error report.
From zhao...@google.com on May 16, 2012 22:32:45
From c#25 we can see that the eax app save/restore for the aflags save/restore is very expensive. One simple optimization for aflags save/restore is to check if there is any eax usage in bb. If no, do not restore app's eax value. By doing so, we can easily restore aflags and app's eax value in the restore state event.
From zhao...@google.com on May 29, 2012 19:49:14
for 471.omnetpp ref input: 471.omnetpp 8.08 ( 2585 / 320), 8x slowdown to native:
Error #1
: UNADDRESSABLE ACCESS: reading 0x084c8d60-0x084c8d70 16 byte(s) within 0x084c8d60-0x084c8d70
Note: elapsed time = 0:00:00.152 in thread 4955 Note: instruction: movdqa (%eax) -> %xmm0
ERRORS FOUND: 114 unique, 17113 total unaddressable access(es) 940 unique, 216820653 total invalid heap argument(s) 0 unique, 0 total warning(s) ERRORS IGNORED:
see issue #901 , we might want to replace strspn with a simple implementation of strspn.
From zhao...@google.com on June 11, 2012 08:52:44
update on performance:
pattern vs light:
spec2k6cmp CINT2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CINT2006.ia32.drm.light-no-mismatch.ref.txt 400.perlbench 0.88 ( 3088 / 3504) 401.bzip2 1.21 ( 2148 / 1773) 403.gcc 0.68 ( 1015 / 1486) 429.mcf 0.70 ( 347 / 495) 445.gobmk 0.84 ( 1599 / 1914) 456.hmmer 0.67 ( 2281 / 3427) 458.sjeng 0.70 ( 1669 / 2390) 462.libquantum 0.58 ( 1219 / 2104) 464.h264ref 0.47 ( 3898 / 8311) 471.omnetpp 0.95 ( 2031 / 2145) 473.astar 1.01 ( 1219 / 1211) 483.xalancbmk 0.85 ( 1735 / 2048) average 0.79
spec2k6cmp CFP2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CFP2006.ia32.drm.light-no-mismatch.ref.txt 410.bwaves 0.68 ( 798 / 1167) 416.gamess 0.63 ( 2224 / 3520) 433.milc 0.68 ( 701 / 1026) 434.zeusmp 0.56 ( 930 / 1663) 435.gromacs 0.86 ( 1137 / 1323) 436.cactusADM 0.95 ( 1474 / 1549) 437.leslie3d 0.50 ( 1278 / 2555) 444.namd 0.67 ( 814 / 1222) 447.dealII 0.78 ( 1790 / 2289) 450.soplex 0.69 ( 689 / 1004) 453.povray 0.68 ( 849 / 1250) 454.calculix 0.65 ( 1935 / 3000) 459.GemsFDTD 0.53 ( 1282 / 2408) 465.tonto 0.84 ( 2936 / 3476) 470.lbm 0.73 ( 554 / 764) 481.wrf 0.76 ( 2972 / 3885) 482.sphinx3 0.66 ( 1200 / 1831) average 0.70
Pattern vs Native
spec2k6cmp CINT2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CINT2006.ia32.native.ref.txt 400.perlbench 7.64 ( 3088 / 404) 401.bzip2 3.02 ( 2148 / 711) 403.gcc 2.99 ( 1015 / 339) 429.mcf 1.23 ( 347 / 283) 445.gobmk 3.08 ( 1599 / 520) 456.hmmer 2.33 ( 2281 / 979) 458.sjeng 2.75 ( 1669 / 606) 462.libquantum 1.52 ( 1219 / 801) 464.h264ref 4.49 ( 3898 / 869) 471.omnetpp 6.35 ( 2031 / 320) 473.astar 2.29 ( 1219 / 533) 483.xalancbmk 6.62 ( 1735 / 262) average 3.69
11:43|zhaoqin@zhaoqin:~/Benchmarks/spec2k6/SPEC_CPU2006v1.2/result
spec2k6cmp CFP2006.ia32.drm.pattern-opt-nomismatch-0x20.ref.txt CFP2006.ia32.native.ref.txt 410.bwaves 1.34 ( 798 / 597) 416.gamess 1.94 ( 2224 / 1144) 433.milc 1.40 ( 701 / 500) 434.zeusmp 1.46 ( 930 / 636) 435.gromacs 1.19 ( 1137 / 957) 436.cactusADM 1.24 ( 1474 / 1186) 437.leslie3d 1.23 ( 1278 / 1039) 444.namd 1.31 ( 814 / 620) 447.dealII 3.57 ( 1790 / 502) 450.soplex 2.23 ( 689 / 309) 453.povray 2.92 ( 849 / 291) 454.calculix 1.56 ( 1935 / 1244) 459.GemsFDTD 1.22 ( 1282 / 1054) 465.tonto 4.03 ( 2936 / 729) 470.lbm 1.26 ( 554 / 438) 481.wrf 2.84 ( 2972 / 1045) 482.sphinx3 1.99 ( 1200 / 604) average 1.93
the slow ones are 400.perlbench, 471.omnetpp, 483.xalancbmk, 447.dealII, 465.tonto, which are all memory allocation intensive ones.
From bruen...@google.com on June 11, 2012 09:00:04
what are the #s with -replace_malloc?
From zhao...@google.com on June 11, 2012 09:01:57
no, it just wrapping the malloc.
From zhao...@google.com on June 12, 2012 08:39:52
The performance improvement on Chrome is small:
[----------] Global test environment tear-down [==========] 4036 tests from 656 test cases ran. (747455 ms total) [ PASSED ] 4035 tests. [ FAILED ] 1 test, listed below: [ FAILED ] HistoryQuickProviderTest.VisitCountMatches
From zhao...@google.com on June 12, 2012 17:47:23
On my window desktop:
shadow light: [----------] Global test environment tear-down [==========] 4306 tests from 685 test cases ran. (1210092 ms total) [ PASSED ] 4306 tests.
pattern: [----------] Global test environment tear-down [==========] 4306 tests from 685 test cases ran. (915119 ms total) [ PASSED ] 4303 tests.
native: [----------] Global test environment tear-down [==========] 4313 tests from 685 test cases ran. (248993 ms total) [ PASSED ] 4313 tests.
From bruen...@google.com on May 14, 2013 17:09:24
pattern mode with -replace_malloc vs pattern wrap, native, and light replace:
spec2k6cmpave namedres/x86.drmem.pattern_replace namedres/x86.drmem.pattern 400.perlbench 0.68 ( 2113 / 3105) 401.bzip2 1.01 ( 2299 / 2286) 403.gcc 1.05 ( 1049 / 1002) 429.mcf 1.01 ( 356 / 354) 445.gobmk 0.96 ( 1636 / 1706) 456.hmmer 1.00 ( 1387 / 1391) 458.sjeng 0.96 ( 1663 / 1728) 462.libquantum 1.01 ( 938 / 931) 464.h264ref 0.84 ( 4815 / 5736) 471.omnetpp 0.64 ( 1168 / 1812) 473.astar 0.98 ( 1093 / 1114) 483.xalancbmk 0.77 ( 1340 / 1750) 410.bwaves 1.00 ( 767 / 765) 416.gamess 0.99 ( 2092 / 2108) 433.milc 1.02 ( 767 / 750) 434.zeusmp 1.00 ( 817 / 815) 435.gromacs 1.00 ( 1107 / 1109) 436.cactusADM 1.02 ( 1390 / 1368) 437.leslie3d 1.00 ( 753 / 752) 444.namd 1.00 ( 732 / 733) 447.dealII 0.85 ( 1542 / 1807) 450.soplex 0.98 ( 570 / 584) 453.povray 0.97 ( 772 / 792) 454.calculix 1.02 ( 1844 / 1812) 459.GemsFDTD 1.00 ( 757 / 758) 465.tonto 0.81 ( 2315 / 2855) 470.lbm 0.99 ( 551 / 555) 481.wrf 0.86 ( 1846 / 2137) 482.sphinx3 1.03 ( 1107 / 1079)
spec2k6cmpave namedres/x86.drmem.pattern_replace namedres/x86.native/ 400.perlbench 5.47 ( 2113 / 386) 401.bzip2 3.44 ( 2299 / 668) 403.gcc 3.04 ( 1049 / 345) 429.mcf 1.30 ( 356 / 274) 445.gobmk 3.23 ( 1636 / 506) 456.hmmer 2.40 ( 1387 / 579) 458.sjeng 2.78 ( 1663 / 598) 462.libquantum 1.37 ( 938 / 686) 464.h264ref 6.14 ( 4815 / 784) 471.omnetpp 3.91 ( 1168 / 299) 473.astar 2.11 ( 1093 / 517) 483.xalancbmk 5.19 ( 1340 / 258) 410.bwaves 1.37 ( 767 / 558) 416.gamess 2.01 ( 2092 / 1041) 433.milc 1.53 ( 767 / 502) 434.zeusmp 1.40 ( 817 / 583) 435.gromacs 1.16 ( 1107 / 952) 436.cactusADM 1.24 ( 1390 / 1117) 437.leslie3d 1.37 ( 753 / 550) 444.namd 1.36 ( 732 / 540) 447.dealII 3.05 ( 1542 / 506) 450.soplex 1.99 ( 570 / 286) 453.povray 2.86 ( 772 / 270) 454.calculix 1.75 ( 1844 / 1056) 459.GemsFDTD 1.47 ( 757 / 515) 465.tonto 3.40 ( 2315 / 681) 470.lbm 1.52 ( 551 / 362) 481.wrf 2.03 ( 1846 / 909) 482.sphinx3 2.09 ( 1107 / 529)
spec2k6cmpave namedres/x86.drmem.pattern_replace namedres/x86.drmem.light_replace 400.perlbench 0.86 ( 2113 / 2467) 401.bzip2 1.20 ( 2299 / 1923) 403.gcc 0.80 ( 1049 / 1306) 429.mcf 0.77 ( 356 / 463) 445.gobmk 0.88 ( 1636 / 1863) 456.hmmer 0.59 ( 1387 / 2355) 458.sjeng 0.69 ( 1663 / 2401) 462.libquantum 0.59 ( 938 / 1580) 464.h264ref 0.58 ( 4815 / 8337) 471.omnetpp 0.58 ( 1168 / 2013) 473.astar 0.93 ( 1093 / 1181) 483.xalancbmk 0.84 ( 1340 / 1600) 410.bwaves 0.65 ( 767 / 1189) 416.gamess 0.65 ( 2092 / 3243) 433.milc 0.76 ( 767 / 1013) 434.zeusmp 0.64 ( 817 / 1279) 435.gromacs 0.88 ( 1107 / 1258) 436.cactusADM 0.96 ( 1390 / 1448) 437.leslie3d 0.65 ( 753 / 1162) 444.namd 0.66 ( 732 / 1115) 447.dealII 0.69 ( 1542 / 2251) 450.soplex 0.66 ( 570 / 866) 453.povray 0.71 ( 772 / 1090) 454.calculix 0.65 ( 1844 / 2830) 459.GemsFDTD 0.68 ( 757 / 1114) 465.tonto 0.97 ( 2315 / 2378) 470.lbm 0.79 ( 551 / 700) 481.wrf 0.74 ( 1846 / 2484) 482.sphinx3 0.63 ( 1107 / 1760)
From bruen...@google.com on January 12, 2012 15:18:14
What is the problem to solve? Why is it important? Provide some context for those unfamiliar with the details of the system. We would like a faster "light mode" that truly focuses on only detecting unaddressable errors. What are the possible approaches to solving the problem? Currently we're using shadow memory. We plan to use a pattern-based approach instead. Xref the 1992 Purify paper and other references to using patterns rather than shadow memory. Patterns are difficult to use for detecting uninitialized reads where we want to delay reporting, but for unaddressable errors they're a natural fit as we can put the patterns into the redzones and report immediately. Which approach is being taken and why? There are tradeoffs in the pattern size and detection of unaligned references, where larger patterns have fewer false matches but require either giving up unaligned detection or using a "medium path". Initially we will likely abandon detection of certain unaligned references. This is, after all, a "light mode", and finding extreme corner cases is not worth a performance impact on finding common cases for "light mode". Any interesting details or challenges of the implementation? By performing a direct compare, we only need to preserve aflags and do not need to spill any other registers than eax for aflags. By using a fault on the slowpath, we have no table or non-local jump or call, making all our instrumentation PIC and thus making it easy to persist.
We may try to adaptively change the actual pattern used based on frequencies observed in the application.
Original issue: http://code.google.com/p/drmemory/issues/detail?id=750