Added a bash script and some data for benchmarking.
Used benchmarking to see what worked, results in benchmarks.tsv.
Summary, relative to gcc-8 -O3:
+5%, get rid of -funroll-loops
+0%, LTO
+10%, no -funroll-loops, LTO+FDO on base code
+15%, LTO+FDO using internal boost and eigen
-25%, clang-7
AutoFDO did not work for me.
BOLT did not work for me.
valgrind does not work because of range errors in boost inv_erfc that need to be fixed first.
gperftools show the top calls are:
460 32.6% 32.6% 543 38.4% substitution::peel_internal_branch
210 14.9% 47.4% 245 17.3% DPmatrixConstrained::forward_cell
38 2.7% 50.1% 47 3.3% substitution::peel_internal_branch (inline)
The run_benchmark.sh script lets one reproduce all of these results and more.
Added a bash script and some data for benchmarking. Used benchmarking to see what worked, results in benchmarks.tsv. Summary, relative to gcc-8 -O3:
AutoFDO did not work for me.
BOLT did not work for me. valgrind does not work because of range errors in boost inv_erfc that need to be fixed first. gperftools show the top calls are: 460 32.6% 32.6% 543 38.4% substitution::peel_internal_branch 210 14.9% 47.4% 245 17.3% DPmatrixConstrained::forward_cell 38 2.7% 50.1% 47 3.3% substitution::peel_internal_branch (inline)
The run_benchmark.sh script lets one reproduce all of these results and more.