OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.27k stars 1.49k forks source link

Julia + OpenBLAS vs. MATLAB + MKL - Matrix Operations Benchmark #1090

Open RoyiAvital opened 7 years ago

RoyiAvital commented 7 years ago

Hi,

I did some tests with MATLAB and Julia:

Matlab & Julia Matrix Operations Benchmark

I think they (At least to some part) reflect OpenBLAS vs. Intel MKL. Hence I think they might be information worth knowing for the developers.

See also here:

Benchmark MATLAB & Julia for Matrix Operations

Thank You.

martin-frbg commented 7 years ago

Thanks for the pointer. Unfortunately it is not quite clear what you are testing here if you are pitting two "teams" against each other - how much of the difference in efficiency comes from each component ? At first glance, the divergence of the graphs after a matrix size of ~1000 primarily suggests that MATLAB+MKL is using threading to its advantage while Julia+OpenBLAS is not. Whether that is due to limitations in Julia or OpenBLAS (how did you build either, did you check if/how many threads they use) is at best unclear, though inferior performance compared to MKL in some functions/circumstances has been noted in the past e.g. #530,#532. I suspect it would be more instructive to run benchmarks on isolated BLAS/LAPACK functions with both MKL and OpenBLAS first - unfortunately none of the current developers appears to have access to MKL.

RoyiAvital commented 7 years ago

@martin-frbg , No pitting at all.

Just thought to show data if it helps the developer from the point of view that seeing the numbers might tell where to invest effort.

You raise interesting point about Multi Threading. Hence I rechecked and it seems all my cores are being utilized (6 Cores). So it is not that MT is disabled under Julia.

I can tell it seems the Eigen, Cholesky and SVD decomposition are a weak point of OpenBLAS compared to MKL. Are you aware of that?

Thank You.

martin-frbg commented 7 years ago

I am certainly aware of #1077 (SVD, I suspect we will need a reduced testcase for that one to investigate further) and I think we also have issues involving *syrk (used in Cholesky) on at least some platforms. There is certainly room for improvement...

brada4 commented 7 years ago

Can you produce graphs in julia-mkl vs julia-openblas? Thay look so proportionat that could boil down to microtiming specialties in each..

RoyiAvital commented 7 years ago

@brada4 , I wish I could. I don't have access to Julia + MKL (I'm on Windows, not going to hack my way for Julia + MKL).

brada4 commented 7 years ago

Windows octave w openblas vs matlab?

RoyiAvital commented 7 years ago

Here you have comparison of Julia + OpenBLAS vs. Julia + MKL on the same tests:

http://imgur.com/a/rBOo8

Those made by:

https://github.com/JuliaLang/julia/issues/18374#issuecomment-278683562

Thank You.

brada4 commented 7 years ago

There is #843 post-0.2.19 adding optimizing fortran flags to lapack, which should align graphs better.

RoyiAvital commented 7 years ago

Another place to look at (Julia + MKL vs. Julia + OpenBLAS):

https://discourse.julialang.org/t/benchmark-matlab-julia-for-matrix-operations/2000/92

https://github.com/barche/julia-blas-benchmarks/blob/master/BenchmarkResults.ipynb

Thank You.

brada4 commented 7 years ago

'matrix generation' does not involve any BLAS, it just measures your libc rng speed and malloc at various times. Probably fastest measure ran first on same system. 'reductions' shows wrong threading threshold in OpenBLAS

RoyiAvital commented 7 years ago

I found another test:

https://www.numbercrunch.de/blog/2016/03/boosting-numpy-with-mkl/

brada4 commented 7 years ago

It lacks anchor to OpenBLAS version. Obviously it cannot be with #843 fixed at that time.

martin-frbg commented 7 years ago

Early march 2016 would mean 0.2.15 or at best 0.2.16rc1 but I guess the point is the availability of the benchmark code and MKL result. (Not that much changed performance-wise for that one function on Haswell I think, might be interesting to see how much restoration of compiler optimization level for the lapack functions as per #843 actually buys us here but I doubt it is enough to close the gap. I do not have an ultrabook haswell as used for the test however ) WRT the "matrix generation" test mentioned above, no harm in having those numbers as well - at the very least it shows that OpenBLAS is not doing something fundamentally wrong in the way it stores and handles matrices.

brada4 commented 7 years ago

from numpy eigh document

The eigenvalues/eigenvectors are computed using LAPACK routines _syevd, _heevd

martin-frbg commented 7 years ago

Well that one is rather obvious. But no matter how silly that lapack handbrake thing was in retrospect, I am not so optimistic to assume that just bringing LAPACK back to its normal speed would allow "our" optimized BLAS calls to show so much gain as to actually match the MKL data.

aminya commented 5 years ago

I updated this repository adding benchmark for Julia+OpenBlas and Julia+Intel MKL https://github.com/aminya/MatlabJuliaMatrixOperationsBenchmark

Julia+Intel MKL is faster then openBlas64 most of the time.

brada4 commented 5 years ago

Not sure if 100Hz clock of matlab counts as bad performance, could you time more iterations to get past that?

aminya commented 5 years ago

Not sure if the 100Hz clock of Matlab counts as bad performance, could you time more iterations to get past that?

Not much difference in most of the functions by running 50 times instead 4 around timeit, however, some differences in some cases. I used a number of iterations around timeit, but the timeit itself calls the function multiple times and returns the median, then I calculate the average of different returns.

For example, inside timeit for matrix inversion, it runs 11*100 iterations and median is returned, and my 4 iterations wrap that to calculate the average of every 1100 iterations' median. If I make the number of iterations around it 50, it becomes like 55000 iterations! while I explicitly stated Julia's sample number as 700 in the code.

In a real-world situation, someone doesn't run a function like 200 times! So repeatability and stability in performance are also important. image

RoyiAvital commented 5 years ago

I think, since MATLAB's Function Handler isn't as efficient as it should be, you shouldn't use the timeit() function.

This is the reason I didn't use it, as it adds overhead. I think it is better to use the approach I used at the original test.

Update

Looking at the code of timeit() they seem to try calculating the overhead and remove it.

For more information:

For more accurate timing in MATLAB - High Accuracy Timer.

aminya commented 5 years ago

Update

Looking at the code of timeit() they seem to try calculating the overhead and remove it.

Yes, timeit() is much more accurate than simple tic and toc. timeit() uses tic and toc inside, but it is smarter to get a better benchmark, that is why it is the function that Mathworks recommends for benchmarking. The situation for both languages is the same. We pass the handle of function to a benchmarking tool, and they calculate the time spent to run that function.

RoyiAvital commented 5 years ago

I'd still prefer direct use of tic() and toc(). The function timeit() is recommended because of the warm start and using median.

I wouldn't use timeit() on the MATLAB. I'd just do a warm start and measure either each iteration or few iterations combined.

aminya commented 5 years ago

I'd still prefer direct use of tic() and toc(). The function timeit() is recommended because of the warm start and using median.

I wouldn't use timeit() on the MATLAB. I'd just do a warm start and measure either each iteration or few iterations combined.

I don't understand the reason for this. inside timeit() is happening what I would do if I wanted to measure the timing, but not even that, they have thought more than me about its accuracy and various aspects of their commercial code. https://www.mathworks.com/help/matlab/matlab_prog/measure-performance-of-your-program.html If you feel the result is biased you can send me another matlabBench file so I can run the test with your code.