Support multiple compilers

Some old-ish performance numbers from a Yelp ticket, just for reference:

5 inital starts, 5 restarts, 200 gradient descent steps, 100 MC iterations
dev28:
icc:
sampled point val = -4.719128032103397929E-01, best_so_far = -1.113240340698242736E+00

1.74user 0.01system 0:01.74elapsed 100%CPU (0avgtext+0avgdata 10144maxresident)k
0inputs+0outputs (0major+689minor)pagefaults 0swaps

1.72user 0.02system 0:01.73elapsed 100%CPU (0avgtext+0avgdata 10128maxresident)k
0inputs+0outputs (0major+688minor)pagefaults 0swaps

gcc:
sampled point val = -4.719128034266237837E-01, best_so_far = -1.113240338551733766E+00

2.93user 0.00system 0:02.92elapsed 100%CPU (0avgtext+0avgdata 7072maxresident)k
0inputs+0outputs (0major+488minor)pagefaults 0swaps

2.92user 0.00system 0:02.92elapsed 99%CPU (0avgtext+0avgdata 7056maxresident)k
0inputs+0outputs (0major+487minor)pagefaults 0swaps

eliu laptop:
sampled point val = -4.719128146996157125E-01, best_so_far = -1.113240577954911048E+00

./EPI 2>&1  1.68s user 0.00s system 99% cpu 1.682 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 1.685 total

./EPI 2>&1  1.68s user 0.00s system 99% cpu 1.692 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 1.693 total

5 inital starts, 5 restarts, 200 gradient descent steps, 10000 MC iterations
dev28
icc:
sampled point val = -5.670208778433040164E-01, best_so_far = -1.237024956163517375E+00

187.41user 0.01system 3:07.72elapsed 99%CPU (0avgtext+0avgdata 10144maxresident)k
0inputs+0outputs (0major+689minor)pagefaults 0swaps

186.28user 0.00system 3:06.58elapsed 99%CPU (0avgtext+0avgdata 10144maxresident)k
0inputs+0outputs (0major+689minor)pagefaults 0swaps

gcc:
sampled point val = -5.670208778433040164E-01, best_so_far = -1.237024956163517375E+00

384.50user 0.00system 6:25.15elapsed 99%CPU (0avgtext+0avgdata 7072maxresident)k
496inputs+0outputs (2major+487minor)pagefaults 0swaps

381.38user 0.00system 6:22.00elapsed 99%CPU (0avgtext+0avgdata 7088maxresident)k
0inputs+0outputs (0major+489minor)pagefaults 0swaps

eliu laptop:
sampled point val = -5.670189259492350864E-01, best_so_far = -1.237024950383871280E+00

./EPI 2>&1  199.52s user 0.11s system 99% cpu 3:19.67 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 3:19.67 total

./EPI 2>&1  198.47s user 0.08s system 99% cpu 3:18.58 total
tee logfileTEMP  0.00s user 0.00s system 0% cpu 3:18.58 total

compiler options used:

CFLAGS = -std=gnu++11 -Wold-style-cast -Wnon-virtual-dtor -Wctor-dtor-privacy -Woverloaded-virtual -Wsign-promo -Wundef -Wshadow -Wcast-align -Wzero-as-null-pointer-constant -Wall -Wextra -g -O2 -fopenmp -march=native -fPIC -ffunction-sections -fdata-sections -fstrict-aliasing -falign-functions=16 -falign-jumps=16 -freorder-blocks -Wno-unused-variable -Wno-strict-aliasing -mpreferred-stack-boundary=4 -mfpmath=sse -ftree-vectorize -fomit-frame-pointer -fno-trapping-math -fno-signaling-nans -fno-math-errno

software/hardware versions:

dev28:

gcc 4.7.2
icc 13.0.1
boost 1.51, compiled with gcc
model name    : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cache size    : 15360 KB

eliu laptop:

gcc: 4.7.2
boost: 1.53 (macports, compiled with gcc)
  Processor Name:    Intel Core i7
  Processor Speed:    2.3 GHz
  L2 Cache (per Core):    256 KB
  L3 Cache:    6 MB

Single threaded performance on the macbook should be markedly higher due to a newer CPU (ivy vs sandy) and higher turboboost. This is reflected in the benchmark data.

For the small test case, icc appears to be about 40% faster than gcc. On the much larger test case (which is more characteristic of our use case), icc is a little over 50% faster. All runs were done with the same set of compiler options and the same code. This leaves some headroom for icc to improve b/c it provides further optimization options that should be even better.

Also, the code is not currently linking BLAS/LAPACK, which will give a further edge to icc (MKL is vastly superior to ATLAS or any other free BLAS) when we do that.

Yelp / MOE

Support multiple compilers #51