Why Fortran is slow in the benchmark "rand_mat_mul" ?

nsapy commented 9 years ago

Hi, Benchmark test results on the home page of Julia (http://julialang.org/) shows that Fortran is about 4x slower than Julia/Numpy in the "rand_mat_mul" benchmark.

I can not understand that why fortran is slower while calling from the same fortran library (BLAS)??

I have also performed a simple test for matrix multiplication evolving fortran, julia and numpy and got the similar results:

Julia

>> n = 1000; A = rand(n,n); B = rand(n,n);
>> @time C = A*B;
>> elapsed time: 0.069577896 seconds (7 MB allocated)

Numpy in IPython

>> from numpy import *
>> n = 1000; A = random.rand(n,n); B = random.rand(n,n);
>> %time C = dot(A,B);
>> Wall time: 98 ms

Fortran

PROGRAM TEST

    IMPLICIT NONE
    INTEGER, PARAMETER :: N = 1000
    INTEGER :: I,J
    REAL*8 :: T0,T1

    REAL*8 :: A(N,N), B(N,N), C(N,N)

    CALL RANDOM_SEED()
    DO I = 1, N, 1
        DO J = 1, N, 1
            CALL RANDOM_NUMBER(A(I,J))
            CALL RANDOM_NUMBER(B(I,J))
        END DO
    END DO

    call cpu_time(t0)
    CALL DGEMM ( "N", "N", N, N, N, 1.D0, A, N, B, N, 0.D0, C, N )
    call cpu_time(t1)

    write(unit=*, fmt="(a24,f10.3,a1)") "Time for Multiplication:",t1-t0,"s"

END PROGRAM TEST

>> gfortran test_blas.f90 libopenblas.dll -O3 & a.exe
>> Time for Multiplication:     0.296s

ViralBShah commented 9 years ago

I am assuming it has something to do with the number of threads and how they get set up. Could you do OPENBLAS_NUM_THREADS=4 or something in the environment and then try the Fortran code again?

ViralBShah commented 9 years ago

Why is the numpy code doing a dot product instead of matrix multiply?

nsapy commented 9 years ago

Below is from the Numpy's help doc

dot(a, b, out=None)

Dot product of two arrays.

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D
arrays to inner product of vectors (without complex conjugation).

nsapy commented 9 years ago

I have set the environment variable with

set OPENBLAS_NUM_THREADS=4

and then add a new line in my fortran source code

Call openblas_set_num_threads(4)

but no markable differences ocurr (I run it twice):

Time for Multiplication: 0.296s Time for Multiplication: 0.312s

ViralBShah commented 9 years ago

Are you linking to the same openblas that is distributed by julia?

StefanKarpinski commented 9 years ago

See also: http://stackoverflow.com/questions/28450367/why-fortran-is-slow-in-the-julia-benchmark-rand-mat-mul. It seems like there may be some issue with using cpu_time here.

mschauer commented 9 years ago

This does not appear in your example, but for the micro benchmark, note that gfortran uses George Marsaglia's KISS PRNG, which may be slower, http://stackoverflow.com/questions/11590086/does-gfortran-generate-random-numbers-more-slowly-than-matlab and http://jblevins.org/mirror/amiller/rnorm.f90, which is slower.

nsapy commented 9 years ago

I have changed the timing function to system_clock() and result turns out to be (I run it five times in one program)

Time for Multiplication: 92ms Time for Multiplication: 92ms Time for Multiplication: 89ms Time for Multiplication: 85ms Time for Multiplication: 94ms

It is approximate as Numpy, but still about 20% slower than Julia.

PS: The openblas I used is from its official website since the "libopenblas.dll" shipped with Julia can not to be recognized by gfortran on my Win7.

StefanKarpinski commented 9 years ago

Well, that's embarrassing. I feel like we've been unfairly maligned Fortran's performance here for a long time. Of course, I didn't write the Fortran code, but still, we should fix this as soon as we can.

scemama commented 9 years ago

I compared gfortran with the default BLAS package on debian (apt-get install libblas-dev) and the Intel Fortran compiler ifort with MKL. Then I tried gfortran with Intel MKL. This is what I get (single-threaded):

$ gfortran -O3 test.f90 -lblas ; ./a.out
Time for Multiplication:     0.432s
$ ifort -O3 test.f90 -mkl=sequential ; ./a.out
Time for Multiplication:     0.044s
$ gfortran -O3 test.f90 -L/opt/intel/composerxe/mkl/lib/intel64 -lmkl_sequential -lmkl_core -lpthread -lmkl_rt
$ ./a.out 
Time for Multiplication:     0.044s

With Ipython, i get:

In [1]: from numpy import *

In [2]: n = 1000; A = random.rand(n,n); B = random.rand(n,n);

In [3]: %time C = dot(A,B);
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.42 s

which is compatible with the results of the Fortran with the default BLAS library on my system.

It seems that it may not be a problem related to gfortran but with the BLAS library which is used with different languages. Could it be that some optimized BLAS is linked statically with some distributions of numpy? That would explain the differences between fortran and python.

I hope this helps...

tkelman commented 9 years ago

Could it be that some optimized BLAS is linked statically with some distributions of numpy?

Not always statically linked, but yes depending on which OS and where you get numpy from, you may end up using reference netlib, or ATLAS, or MKL (or Apple's Accelerate which I believe is derived from MKL), maybe even ACML on AMD processors. I don't think any of the major numpy distributions are using openblas yet, but if you have a package manager that symlinks libblas.so to openblas via alternatives then that can work too. It's kind of a zoo.

ViralBShah commented 9 years ago

@nsapy Could you submit a PR to fix our Fortran microbenchmark?

nsapy commented 9 years ago

@ViralBShah I am sorry but I do not know what do you mean by "submit a PR" since I am a newbie here ...

nsapy commented 9 years ago

@scemama @tkelman I use the python from the anaconda distribution (http://www.continuum.io) but I am not sure whether optimized BLAS is used.

scemama commented 9 years ago

@nsapy You could probably use ldd to check which dynamic library will be used. On my system, there is a _dotblas.so file in the numpy directory. ldd tells me which blas is used (2nd ilne of output):

$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so 
    linux-vdso.so.1 =>  (0x00007fffd9798000)
    libblas.so.3 => /usr/lib/libblas.so.3 (0x00002b99f1f3f000)
    libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00002b99f21e0000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00002b99f24f6000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00002b99f2778000)
    libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00002b99f298f000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00002b99f2bc4000)
    /lib64/ld-linux-x86-64.so.2 (0x00002b99f1aff000)

nsapy commented 9 years ago

@scemama I am using win7 and I use the command dumpbin /DEPENDENTS _dotblas.pyd and get

\>Dump of file _dotblas.pyd

\>File Type: DLL

\>  Image has the following dependencies:

\>    libiomp5md.dll
\>    python27.dll
\>    MSVCR90.dll
\>    KERNEL32.dll

Seems it uses openmp to do parallel computing.

PallHaraldsson commented 9 years ago

As this is closed, should Fortran's numbers at julialang.org be updated (with or without a new version of compiler)? I see #10175 was merged and backported to 0.3 (not sure what that means as it is Fortran90 code..). Maybe it's time to update Julia's numbers to a newer version also (I'm not sure you would want to do that and not other languages)?

jiahao commented 9 years ago

I have updated the benchmarks, but Fortran's performance on the random matrix multiply benchmark did not drop to 1.0. According to the microbenchmark makefile, gfortran cannot handle OpenBLAS when linked with 64-bit integers (USE_BLAS64=1, our default). Hence -f-external-blas is not passed to Fortran in this case. I have verified that the Fortran programs built on our test machine, which has gfortran 4.8.2 on 64bit Ubuntu 14.04, does indeed segfault.

JuliaLang / julia

Why Fortran is slow in the benchmark "rand_mat_mul" ? #10157