Yelp / MOE

A global, black box optimization engine for real world metric optimization.
Other
1.31k stars 140 forks source link

Support multiple compilers #51

Open suntzu86 opened 10 years ago

suntzu86 commented 10 years ago

Users should be able to select what compiler to use at the manual install level (i think this is already reasonably easy) and at the docker level.

gcc, clang, icc are the main ones we'd care about, I think.

icc in particular is interesting b/c the resulting code is at least 2x faster than gcc (even in long monte carlo loops).

suntzu86 commented 10 years ago

Some old-ish performance numbers from a Yelp ticket, just for reference:

5 inital starts, 5 restarts, 200 gradient descent steps, 100 MC iterations
dev28:
icc:
sampled point val = -4.719128032103397929E-01, best_so_far = -1.113240340698242736E+00

1.74user 0.01system 0:01.74elapsed 100%CPU (0avgtext+0avgdata 10144maxresident)k
0inputs+0outputs (0major+689minor)pagefaults 0swaps

1.72user 0.02system 0:01.73elapsed 100%CPU (0avgtext+0avgdata 10128maxresident)k
0inputs+0outputs (0major+688minor)pagefaults 0swaps

gcc:
sampled point val = -4.719128034266237837E-01, best_so_far = -1.113240338551733766E+00

2.93user 0.00system 0:02.92elapsed 100%CPU (0avgtext+0avgdata 7072maxresident)k
0inputs+0outputs (0major+488minor)pagefaults 0swaps

2.92user 0.00system 0:02.92elapsed 99%CPU (0avgtext+0avgdata 7056maxresident)k
0inputs+0outputs (0major+487minor)pagefaults 0swaps

eliu laptop:
sampled point val = -4.719128146996157125E-01, best_so_far = -1.113240577954911048E+00

./EPI 2>&1  1.68s user 0.00s system 99% cpu 1.682 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 1.685 total

./EPI 2>&1  1.68s user 0.00s system 99% cpu 1.692 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 1.693 total

5 inital starts, 5 restarts, 200 gradient descent steps, 10000 MC iterations
dev28
icc:
sampled point val = -5.670208778433040164E-01, best_so_far = -1.237024956163517375E+00

187.41user 0.01system 3:07.72elapsed 99%CPU (0avgtext+0avgdata 10144maxresident)k
0inputs+0outputs (0major+689minor)pagefaults 0swaps

186.28user 0.00system 3:06.58elapsed 99%CPU (0avgtext+0avgdata 10144maxresident)k
0inputs+0outputs (0major+689minor)pagefaults 0swaps

gcc:
sampled point val = -5.670208778433040164E-01, best_so_far = -1.237024956163517375E+00

384.50user 0.00system 6:25.15elapsed 99%CPU (0avgtext+0avgdata 7072maxresident)k
496inputs+0outputs (2major+487minor)pagefaults 0swaps

381.38user 0.00system 6:22.00elapsed 99%CPU (0avgtext+0avgdata 7088maxresident)k
0inputs+0outputs (0major+489minor)pagefaults 0swaps

eliu laptop:
sampled point val = -5.670189259492350864E-01, best_so_far = -1.237024950383871280E+00

./EPI 2>&1  199.52s user 0.11s system 99% cpu 3:19.67 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 3:19.67 total

./EPI 2>&1  198.47s user 0.08s system 99% cpu 3:18.58 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 3:18.58 total

compiler options used:

CFLAGS = -std=gnu++11 -Wold-style-cast -Wnon-virtual-dtor -Wctor-dtor-privacy -Woverloaded-virtual -Wsign-promo -Wundef -Wshadow -Wcast-align -Wzero-as-null-pointer-constant -Wall -Wextra -g -O2 -fopenmp -march=native -fPIC -ffunction-sections -fdata-sections -fstrict-aliasing -falign-functions=16 -falign-jumps=16 -freorder-blocks -Wno-unused-variable -Wno-strict-aliasing -mpreferred-stack-boundary=4 -mfpmath=sse -ftree-vectorize -fomit-frame-pointer -fno-trapping-math -fno-signaling-nans -fno-math-errno

software/hardware versions:

gcc 4.7.2
icc 13.0.1
boost 1.51, compiled with gcc
model name    : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cache size    : 15360 KB

eliu laptop:

gcc: 4.7.2
boost: 1.53 (macports, compiled with gcc)
  Processor Name:    Intel Core i7
  Processor Speed:    2.3 GHz
  L2 Cache (per Core):    256 KB
  L3 Cache:    6 MB

Single threaded performance on the macbook should be markedly higher due to a newer CPU (ivy vs sandy) and higher turboboost. This is reflected in the benchmark data.

For the small test case, icc appears to be about 40% faster than gcc. On the much larger test case (which is more characteristic of our use case), icc is a little over 50% faster. All runs were done with the same set of compiler options and the same code. This leaves some headroom for icc to improve b/c it provides further optimization options that should be even better.

Also, the code is not currently linking BLAS/LAPACK, which will give a further edge to icc (MKL is vastly superior to ATLAS or any other free BLAS) when we do that.