franko / gsl-shell

GSL library shell based on LuaJIT2
http://franko.github.io/gsl-shell/
GNU General Public License v3.0
92 stars 12 forks source link

LuaJIT 2.1? #25

Open pygy opened 10 years ago

pygy commented 10 years ago

Do you plan to migrate to the 2.1 branch? It is faster than v2.0.x, and AFAIK stable enough to be used in production at CloudFlare.

franko commented 10 years ago

Hi,

actually I was just waiting the first stable release of the 2.1 branch but, as you suggest, it is probably ok to migrate with the next release of GSL Shell. I've actually a lot of minor changes to include and a new release is a good thing.

Otherwise what about a new gsl shell's branch in github to integrate luajit 2.1 ?

pygy commented 10 years ago

Otherwise what about a new gsl shell's branch in github to integrate luajit 2.1 ?

Why not.

AFAICT, the parse.c and Makefile modifications work as is in 2.1.

I'll also have to send you a patch for compiling on OS X 10.8

franko commented 10 years ago

Now there is a v2.1 branch in GSL Shell's repository

https://github.com/franko/gsl-shell/tree/master-lj2.1

The merge was very easy thanks to the power of git :-) and everything seems to work just fine.

Francesco

pygy commented 10 years ago

Cool :-)

The Julia guys are about to add the LuaJIT/GSL Shell benchmarks you wrote on their home page. I'll point them to the LJ 2.1 branch.

LuaJIT v2.1 is 10 times faster than v2.0 for parseint, but a bit slower for mandel (but, in both cases, it still beats the hell out of C :-).

The pure JavaScript (V8) implementation of rand_mat_stat is faster than its GSL Shell counterpart, which relies on BLAS, as do the C, Julia and Fortran benchmars. The latter three are also faster than LuaJIT/GSL Shell. Maybe you're not using the same BLAS?

LuaJIT is ~10 times slower than C for rand_mat_mul, but faster than JS.

Check here for the results on my machine: https://github.com/JuliaLang/julia/commit/9a57b996c91383527404f1adbdc0b29af8e6f798#commitcomment-5996981

Edit: note also that quicksort can be made faster by switching to a FFI array.

franko commented 10 years ago

The benchmark results looks good to me.

I agree that there are some odd things. I already noticed in past that Julia was faster in rand_mat_mul but I cannot tell the reason. The only things I can suggest is to ensure that openblas is actually used for gsl shell. For me the over speed should be given by the underlying BLAS implementation.

Otherwise I would not be too picky about this benchmark results and I'm afraid I don't have enough time to further investigate the problem.

In any case I will be glad if they include lua/gsl-shell in their benchmark page. Thank you for your help about that.

pygy commented 10 years ago

How can I set the BLAS version?

On my machine, the GSL-based rand_mat_mul is ~10% faster than a straight port of the JavaScript code to Lua:

local darray = ffi.typeof("double[?]")

local function randd(n)
    local v, r
    v = darray(n)
    r = rng.new('rand')

    for i = 0, n-1 do
        v[i] = r:get()
    end

    return v
end

-- Transpose mxn matrix.
local function mattransp(A, m, n)
    local T = darray(m * n)

    for i = 0, m - 1 do
        for j = 0, n-1 do
            T[j * m + i] = A[i * n + j]
        end
    end
    return T
end

local function matmul(A,B,m,l,n)
    local C, total
    C = darray(m*n)
    -- Transpose B to take advantage of memory locality.
    B = mattransp(B,l,n)

    for i = 0, m - 1 do
        for j = 0, n - 1 do
            total = 0

            for k = 0, l - 1 do
                total = total + A[i*l+k]*B[j*l+k]
            end

            C[i*n+j] = total
        end
    end

    return C
end

local function randmatmulLJ(n)
    local A, B
    A = randd(n*n)
    B = randd(n*n)

    return matmul(A, B, n, n, n)
end

timeit(|| randmatmul(1000), "rand_mat_mul")      --> 1129.19
timeit(|| randmatmulLJ(1000), "rand_mat_mul_LJ") --> 1255.42

BTW:

$ node perf.js
...
javascript,rand_mat_mul,2933

:-)

franko commented 10 years ago

To check the BLAS library you have to "ldd" the executable and see to which file libblas.so points to by using "ls -l ".

I'm now wondering if Julia is faster because it does transpose the matrix before the multiplication just like JS is doing. In principle I should do some tests with dgemm with and without transpose like in the JS code but unfortunately I don't have time to work on that.

pygy commented 10 years ago

There's no ldd on OS X, otool -L does the trick.

$ otool -L gsl-shell | grep blas
    /usr/local/lib/libgslcblas.0.dylib (compatibility version 1.0.0, current version 1.0.0)
$ ls -l /usr/local/lib/libgslcblas.0.dylib
lrwxr-xr-x  1 pygy  staff  42 Apr 12 23:57 /usr/local/lib/libgslcblas.0.dylib -> ../Cellar/gsl/1.16/lib/libgslcblas.0.dylib

The GSL, as installed by brew relies on the default libgslcblas. I've tried to redirect the symlink to a freshly compiled OpenBLAS, but it complains about version issues (1.0.0 required, 0.0.0 found). The same goes for the Julia BLAS.

I'm also trying to build the gsl by hand, but I don't know how to tell it to use another BLAS.

pygy commented 10 years ago

I got it to compile with OpenBLAS (by adding the proper paths and options in the GSL Shell Makefile).

randommatmul is now as fast as C/Julia :-)

It may be nice to add the possibility to customize LIBS and LDFLAGS in makeconfig.

franko commented 10 years ago

Good :-)

Actually the libraries are supposed to be configurable using the file "makepackages" but may be this is not very intuitive.

On linux "makepackages" links with any "blas" library (using GSL_LIBS) provided by the system and thus openblas is not required. It is possible to modify the default makefile to links explicitly to openblas but I'm not sure this is a good idea.

May be a warning can be shown during compile time if the gslcblas library is used since this latter is really slow.

Suggestion & patches are welcome.

pygy commented 10 years ago

makepackage is probably fine... I tend to explore code rather than read the docs (too often, there are none), and I though that makeconfig was were users were supposed to tweak things.

OS X also provides a fast BLAS, I'll look how to link to it.

pygy commented 10 years ago

I found the system BLAS, which is even faster than OpenBLAS, but I don't know if it is found at the same path for all OS X versions.

Edit: actally, adding -lBLAS to the GSL_LIBS does the trick, without adding any path to the linker (which actually makes sense).