allow the use of openblas

mikestillman commented 8 years ago

Word on the street is that this blas is far better than the default blas, on Ubuntu at least, and probably other linuxes too. After compiling it, I think that the new library just needs to be added on the link library list. It was recommended to me that we compile it from source, to better use the facilities on each target machine. However, we should probably also allow the use of the ubuntu openblas package for building distributions.

It would be nice to allow the use of openblas, and if it is actually far superior, make it the default. On mac's though, we currently use the Accelerate framework, which seems to be very good. Even there though, it might be good to compare them.

There are several reasons for this request, but my main interest right at the moment is to improve the speed of rank computations in ffpack (which is used in the fast non-minimal free resolution code). Currently, if I compare across machines, I find that Ubuntu is perhaps 5-10 times slower at such computations than on my mac laptop, which is a year or two old.

I will add in benchmarks to check this, and so we can see what any actual improvement is.

DanGrayson commented 8 years ago

It might be good to include a benchmark written in fortran that could be run immediately after compiling openblas or another blas. Give it to me and I'll put it somewhere appropriate.

DanGrayson commented 8 years ago

Here's some info on a possibly useful debian/ubuntu package for testing blas:

Package: libblas-test
Priority: optional
Section: universe/libs
Installed-Size: 1882
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian Science Team <debian-science-maintainers@lists.alioth.debian.org>
Architecture: amd64
Source: lapack
Version: 3.6.0-2ubuntu2
Depends: libblas3 | libblas.so.3, libc6 (>= 2.4), libgfortran3 (>= 4.6)
Filename: pool/universe/l/lapack/libblas-test_3.6.0-2ubuntu2_amd64.deb
Size: 303704
MD5sum: 45894116ac90759bd2c8fbb965aeaa31
SHA1: a58cbfca37a5885ba8a519d3ad65a114ff2c59f2
SHA256: 805804aa6844249da5acbc508273d52a43f25db55fcb0cfec2e6a5c027351a8e
Description-en: Basic Linear Algebra Subroutines 3, testing programs
 BLAS (Basic Linear Algebra Subroutines) is a set of efficient
 routines for most of the basic vector and matrix operations.
 They are widely used as the basis for other high quality linear
 algebra software, for example lapack and linpack.  This
 implementation is the Fortran 77 reference implementation found
 at netlib.
 .
 This package contains a set of programs which test the integrity of an
 installed blas-compatible shared library. These programs may therefore be used
 to test the libraries provided by the blas package as well as those provided
 by the libatlas3-base and libopenblas-base packages. The programs are
 dynamically linked -- one can explicitly select a library to test by setting
 the libblas.so.3 alternative, or by using the LD_LIBRARY_PATH or LD_PRELOAD
 environment variables. Likewise, one can display the library selected using
 the ldd program in an identical environment.
Description-md5: 7e697a3bd80892afd85df0f1b0596433
Homepage: http://www.netlib.org/lapack/
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu

dimpase commented 8 years ago

Sage uses Atlas, which is pain to install (automatic tuning is very slow), but is reasonably fast on Linux and OSX. It would be interesting to compare how it compares with openblas, once M2 on Sage is working...

mikestillman commented 8 years ago

Here is an example, using M2, that hopefully indicates that using a better blas would have a significant effect on these computations (note: for one example of free resolutions, a similar rank computation (on SL) took 4 days, so improving by a factor would be excellent!)

restart
debug Core
kk = ZZp(32003, Strategy=>"Ffpack")
kk1 = ZZp(32003, Strategy=>"Flint")
elapsedTime M = random(ZZ^4000, ZZ^4000, Height=>32000, Density=>.2);
time M0 = mutableMatrix promote(M,kk);
time M1 = mutableMatrix promote(M,kk1);
time rank M0  -- this line uses the blas heavily
time rank M1  -- this line doesn't use the blas as far as I know.

elapsedTime M = random(ZZ^6000, ZZ^6000, Height=>32000, Density=>.2);
time M2 = mutableMatrix promote(M,kk);
time M3 = mutableMatrix promote(M,kk1);
time rank M2  -- this line uses the blas heavily
time rank M3  -- this line doesn't use the blas as far as I know.

-- the times for the 4 rank commands
-- MacBookPro, running 10.10.5, 16 GB ram, Mid 2014 Retina MacBookPro.
time rank M0 -- 2.27 sec
time rank M1 -- 7.82 sec
time rank M2 -- 7.01 sec
time rank M3 -- 40.99 sec

-- On an SL machine, which seems to be about the same speed (perhaps a bit faster) than
-- my mac:
time rank M0 -- 16.72 sec
time rank M1 -- 7.85 sec
time rank M2 -- 52.32 sec
time rank M3 -- 23.9 sec

-- the blas code appears to be running somewhat more than 7 times slower on
-- SL than on the mac.  I think ubuntu is similar to SL in speed here.
-- perhaps openblas can improve this?

mikestillman commented 8 years ago

By the way, about my code in the previous post: sorry, I chose a time inefficient manner to create these matrices.

mikestillman commented 7 years ago

@DanGrayson this one is important :)

dimpase commented 7 years ago

by the way, Sage has switched to openblas.

mahrud commented 4 years ago

Generic lapack and blas don't take advantage of cpu cores and CPU vectorization (e.g. SSE2, which is ubiquitous now). Here's some information about benchmarking in numpy: https://markus-beuckelmann.de/blog/boosting-numpy-blas.html On that note, Eigen's API is different, so we would have to change our code, but it seems to be a great contender: http://eigen.tuxfamily.org/index.php?title=Benchmark

dimpase commented 4 years ago

By the way, Sage switched to openblas years ago.

mahrud commented 4 years ago

With the CMake build, we have, too! Hopefully the autotools build is next.

mahrud commented 4 years ago

Here's a quick benchmark. First I had to comment out everything after line 320 of quarantine/lapack.m2 since an engine routine is failing for matrices with zero rows or columns.

Using OpenBLAS:

[mahrud@noether build]$ ctest -R lapack --repeat-until-fail 10
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/build
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.72 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.78 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.77 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.89 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.84 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.93 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.92 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    2.01 sec
    Start 3243: quarantine/lapack.m2
    Test #3243: quarantine/lapack.m2 .............   Passed    1.99 sec
    Start 3243: quarantine/lapack.m2
1/1 Test #3243: quarantine/lapack.m2 .............   Passed    2.09 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =  19.18 sec

Compared with LAPACK/BLAS:

[mahrud@noether blas]$ ctest -R lapack --repeat-until-fail 10
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/blas
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    3.04 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    3.12 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    3.17 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    3.20 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    3.16 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    3.13 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    2.93 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    2.96 sec
    Start 522: quarantine/lapack.m2
    Test #522: quarantine/lapack.m2 .............   Passed    2.97 sec
    Start 522: quarantine/lapack.m2
1/1 Test #522: quarantine/lapack.m2 .............   Passed    2.94 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =  30.72 sec

That's a 37% improvement.

mahrud commented 4 years ago

The effect on gb tests is even more significant with 58% improvement.

OpenBLAS:

[mahrud@noether build]$ ctest -R normal/gb
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/build
      Start 2974: normal/gb-matrix-lift.m2
 1/14 Test #2974: normal/gb-matrix-lift.m2 .........   Passed    0.51 sec
      Start 2975: normal/gb-skew-ZZ.m2
 2/14 Test #2975: normal/gb-skew-ZZ.m2 .............   Passed    0.53 sec
      Start 2976: normal/gb-snapp-bug.m2
 3/14 Test #2976: normal/gb-snapp-bug.m2 ...........   Passed    0.50 sec
      Start 2977: normal/gb2.m2
 4/14 Test #2977: normal/gb2.m2 ....................   Passed    0.51 sec
      Start 2978: normal/gbQQbug.m2
 5/14 Test #2978: normal/gbQQbug.m2 ................   Passed    0.57 sec
      Start 2979: normal/gbZZ-2.m2
 6/14 Test #2979: normal/gbZZ-2.m2 .................   Passed    0.52 sec
      Start 2980: normal/gbZZ-mingens.m2
 7/14 Test #2980: normal/gbZZ-mingens.m2 ...........   Passed    0.56 sec
      Start 2981: normal/gbZZ13.m2
 8/14 Test #2981: normal/gbZZ13.m2 .................   Passed    0.78 sec
      Start 2982: normal/gbZZautoreduction.m2
 9/14 Test #2982: normal/gbZZautoreduction.m2 ......   Passed    0.52 sec
      Start 2983: normal/gbZZbug.m2
10/14 Test #2983: normal/gbZZbug.m2 ................   Passed    1.33 sec
      Start 2984: normal/gbZZbug2-a.m2
11/14 Test #2984: normal/gbZZbug2-a.m2 .............   Passed    0.61 sec
      Start 2985: normal/gbZZbug2.m2
12/14 Test #2985: normal/gbZZbug2.m2 ...............   Passed    0.62 sec
      Start 2986: normal/gbinhom.m2
13/14 Test #2986: normal/gbinhom.m2 ................   Passed    0.52 sec
      Start 2987: normal/gblimits.m2
14/14 Test #2987: normal/gblimits.m2 ...............   Passed    0.54 sec

100% tests passed, 0 tests failed out of 14

Total Test time (real) =   8.84 sec

BLAS/LAPACK:

[mahrud@noether blas]$ ctest -R normal/gb
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/blas
      Start 3007: normal/gb-matrix-lift.m2
 1/14 Test #3007: normal/gb-matrix-lift.m2 .........   Passed    1.33 sec
      Start 3008: normal/gb-skew-ZZ.m2
 2/14 Test #3008: normal/gb-skew-ZZ.m2 .............   Passed    1.32 sec
      Start 3009: normal/gb-snapp-bug.m2
 3/14 Test #3009: normal/gb-snapp-bug.m2 ...........   Passed    1.35 sec
      Start 3010: normal/gb2.m2
 4/14 Test #3010: normal/gb2.m2 ....................   Passed    1.33 sec
      Start 3011: normal/gbQQbug.m2
 5/14 Test #3011: normal/gbQQbug.m2 ................   Passed    1.38 sec
      Start 3012: normal/gbZZ-2.m2
 6/14 Test #3012: normal/gbZZ-2.m2 .................   Passed    1.35 sec
      Start 3013: normal/gbZZ-mingens.m2
 7/14 Test #3013: normal/gbZZ-mingens.m2 ...........   Passed    1.42 sec
      Start 3014: normal/gbZZ13.m2
 8/14 Test #3014: normal/gbZZ13.m2 .................   Passed    1.73 sec
      Start 3015: normal/gbZZautoreduction.m2
 9/14 Test #3015: normal/gbZZautoreduction.m2 ......   Passed    1.41 sec
      Start 3016: normal/gbZZbug.m2
10/14 Test #3016: normal/gbZZbug.m2 ................   Passed    2.25 sec
      Start 3017: normal/gbZZbug2-a.m2
11/14 Test #3017: normal/gbZZbug2-a.m2 .............   Passed    1.59 sec
      Start 3018: normal/gbZZbug2.m2
12/14 Test #3018: normal/gbZZbug2.m2 ...............   Passed    1.55 sec
      Start 3019: normal/gbinhom.m2
13/14 Test #3019: normal/gbinhom.m2 ................   Passed    1.45 sec
      Start 3020: normal/gblimits.m2
14/14 Test #3020: normal/gblimits.m2 ...............   Passed    1.37 sec

100% tests passed, 0 tests failed out of 14

Total Test time (real) =  21.09 sec

DanGrayson commented 4 years ago

What remains is to switch the autotools build over to openblas.

dimpase commented 4 years ago

If you don't want to build your own openblas, openblas comes with openblas.pc, i.e. you can get info about it via pkg-config, or rather, PKG_CHECK_MODULES etc. Here is what we do in Sage https://github.com/sagemath/sage/blob/develop/build/pkgs/openblas/spkg-configure.m4

Admittedly, complicated - the problem is that different Linux distros package openblas differently, sometimes you need a separate libcblas, etc (but please ask questions about it, I wrote an initial version of that monster after all :-))

d-torrance commented 1 week ago

Fixed in #3461

Macaulay2 / M2

allow the use of openblas #475