Closed nsapy closed 9 years ago
I am assuming it has something to do with the number of threads and how they get set up. Could you do OPENBLAS_NUM_THREADS=4
or something in the environment and then try the Fortran code again?
Why is the numpy code doing a dot product instead of matrix multiply?
dot(a, b, out=None)
Dot product of two arrays.
For 2-D arrays it is equivalent to matrix multiplication, and for 1-D
arrays to inner product of vectors (without complex conjugation).
I have set the environment variable with
set OPENBLAS_NUM_THREADS=4
and then add a new line in my fortran source code
Call openblas_set_num_threads(4)
but no markable differences ocurr (I run it twice):
Time for Multiplication: 0.296s Time for Multiplication: 0.312s
Are you linking to the same openblas that is distributed by julia?
See also: http://stackoverflow.com/questions/28450367/why-fortran-is-slow-in-the-julia-benchmark-rand-mat-mul. It seems like there may be some issue with using cpu_time
here.
This does not appear in your example, but for the micro benchmark, note that gfortran uses George Marsaglia's KISS PRNG, which may be slower, http://stackoverflow.com/questions/11590086/does-gfortran-generate-random-numbers-more-slowly-than-matlab and http://jblevins.org/mirror/amiller/rnorm.f90, which is slower.
I have changed the timing function to system_clock() and result turns out to be (I run it five times in one program)
Time for Multiplication: 92ms Time for Multiplication: 92ms Time for Multiplication: 89ms Time for Multiplication: 85ms Time for Multiplication: 94ms
It is approximate as Numpy, but still about 20% slower than Julia.
PS: The openblas I used is from its official website since the "libopenblas.dll" shipped with Julia can not to be recognized by gfortran on my Win7.
Well, that's embarrassing. I feel like we've been unfairly maligned Fortran's performance here for a long time. Of course, I didn't write the Fortran code, but still, we should fix this as soon as we can.
I compared gfortran with the default BLAS package on debian (apt-get install libblas-dev
) and
the Intel Fortran compiler ifort with MKL. Then I tried gfortran with Intel MKL.
This is what I get (single-threaded):
$ gfortran -O3 test.f90 -lblas ; ./a.out
Time for Multiplication: 0.432s
$ ifort -O3 test.f90 -mkl=sequential ; ./a.out
Time for Multiplication: 0.044s
$ gfortran -O3 test.f90 -L/opt/intel/composerxe/mkl/lib/intel64 -lmkl_sequential -lmkl_core -lpthread -lmkl_rt
$ ./a.out
Time for Multiplication: 0.044s
With Ipython, i get:
In [1]: from numpy import *
In [2]: n = 1000; A = random.rand(n,n); B = random.rand(n,n);
In [3]: %time C = dot(A,B);
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.42 s
which is compatible with the results of the Fortran with the default BLAS library on my system.
It seems that it may not be a problem related to gfortran but with the BLAS library which is used with different languages. Could it be that some optimized BLAS is linked statically with some distributions of numpy? That would explain the differences between fortran and python.
I hope this helps...
Could it be that some optimized BLAS is linked statically with some distributions of numpy?
Not always statically linked, but yes depending on which OS and where you get numpy from, you may end up using reference netlib, or ATLAS, or MKL (or Apple's Accelerate which I believe is derived from MKL), maybe even ACML on AMD processors. I don't think any of the major numpy distributions are using openblas yet, but if you have a package manager that symlinks libblas.so to openblas via alternatives then that can work too. It's kind of a zoo.
@nsapy Could you submit a PR to fix our Fortran microbenchmark?
@ViralBShah I am sorry but I do not know what do you mean by "submit a PR" since I am a newbie here ...
@scemama @tkelman I use the python from the anaconda distribution (http://www.continuum.io) but I am not sure whether optimized BLAS is used.
@nsapy You could probably use ldd
to check which dynamic library will be used. On my system, there is a _dotblas.so
file in the numpy directory. ldd
tells me which blas is used (2nd ilne of output):
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
linux-vdso.so.1 => (0x00007fffd9798000)
libblas.so.3 => /usr/lib/libblas.so.3 (0x00002b99f1f3f000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00002b99f21e0000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00002b99f24f6000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00002b99f2778000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00002b99f298f000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00002b99f2bc4000)
/lib64/ld-linux-x86-64.so.2 (0x00002b99f1aff000)
@scemama I am using win7 and I use the command
dumpbin /DEPENDENTS _dotblas.pyd
and get
\>Dump of file _dotblas.pyd
\>File Type: DLL
\> Image has the following dependencies:
\> libiomp5md.dll
\> python27.dll
\> MSVCR90.dll
\> KERNEL32.dll
Seems it uses openmp to do parallel computing.
As this is closed, should Fortran's numbers at julialang.org be updated (with or without a new version of compiler)? I see #10175 was merged and backported to 0.3 (not sure what that means as it is Fortran90 code..). Maybe it's time to update Julia's numbers to a newer version also (I'm not sure you would want to do that and not other languages)?
I have updated the benchmarks, but Fortran's performance on the random matrix multiply benchmark did not drop to 1.0. According to the microbenchmark makefile, gfortran cannot handle OpenBLAS when linked with 64-bit integers (USE_BLAS64=1
, our default). Hence -f-external-blas
is not passed to Fortran in this case. I have verified that the Fortran programs built on our test machine, which has gfortran 4.8.2 on 64bit Ubuntu 14.04, does indeed segfault.
Hi, Benchmark test results on the home page of Julia (http://julialang.org/) shows that Fortran is about 4x slower than Julia/Numpy in the "rand_mat_mul" benchmark.
I can not understand that why fortran is slower while calling from the same fortran library (BLAS)??
I have also performed a simple test for matrix multiplication evolving fortran, julia and numpy and got the similar results: