Closed mikestillman closed 1 week ago
It might be good to include a benchmark written in fortran that could be run immediately after compiling openblas or another blas. Give it to me and I'll put it somewhere appropriate.
Here's some info on a possibly useful debian/ubuntu package for testing blas:
Package: libblas-test
Priority: optional
Section: universe/libs
Installed-Size: 1882
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian Science Team <debian-science-maintainers@lists.alioth.debian.org>
Architecture: amd64
Source: lapack
Version: 3.6.0-2ubuntu2
Depends: libblas3 | libblas.so.3, libc6 (>= 2.4), libgfortran3 (>= 4.6)
Filename: pool/universe/l/lapack/libblas-test_3.6.0-2ubuntu2_amd64.deb
Size: 303704
MD5sum: 45894116ac90759bd2c8fbb965aeaa31
SHA1: a58cbfca37a5885ba8a519d3ad65a114ff2c59f2
SHA256: 805804aa6844249da5acbc508273d52a43f25db55fcb0cfec2e6a5c027351a8e
Description-en: Basic Linear Algebra Subroutines 3, testing programs
BLAS (Basic Linear Algebra Subroutines) is a set of efficient
routines for most of the basic vector and matrix operations.
They are widely used as the basis for other high quality linear
algebra software, for example lapack and linpack. This
implementation is the Fortran 77 reference implementation found
at netlib.
.
This package contains a set of programs which test the integrity of an
installed blas-compatible shared library. These programs may therefore be used
to test the libraries provided by the blas package as well as those provided
by the libatlas3-base and libopenblas-base packages. The programs are
dynamically linked -- one can explicitly select a library to test by setting
the libblas.so.3 alternative, or by using the LD_LIBRARY_PATH or LD_PRELOAD
environment variables. Likewise, one can display the library selected using
the ldd program in an identical environment.
Description-md5: 7e697a3bd80892afd85df0f1b0596433
Homepage: http://www.netlib.org/lapack/
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu
Sage uses Atlas, which is pain to install (automatic tuning is very slow), but is reasonably fast on Linux and OSX. It would be interesting to compare how it compares with openblas, once M2 on Sage is working...
Here is an example, using M2, that hopefully indicates that using a better blas would have a significant effect on these computations (note: for one example of free resolutions, a similar rank computation (on SL) took 4 days, so improving by a factor would be excellent!)
restart
debug Core
kk = ZZp(32003, Strategy=>"Ffpack")
kk1 = ZZp(32003, Strategy=>"Flint")
elapsedTime M = random(ZZ^4000, ZZ^4000, Height=>32000, Density=>.2);
time M0 = mutableMatrix promote(M,kk);
time M1 = mutableMatrix promote(M,kk1);
time rank M0 -- this line uses the blas heavily
time rank M1 -- this line doesn't use the blas as far as I know.
elapsedTime M = random(ZZ^6000, ZZ^6000, Height=>32000, Density=>.2);
time M2 = mutableMatrix promote(M,kk);
time M3 = mutableMatrix promote(M,kk1);
time rank M2 -- this line uses the blas heavily
time rank M3 -- this line doesn't use the blas as far as I know.
-- the times for the 4 rank commands
-- MacBookPro, running 10.10.5, 16 GB ram, Mid 2014 Retina MacBookPro.
time rank M0 -- 2.27 sec
time rank M1 -- 7.82 sec
time rank M2 -- 7.01 sec
time rank M3 -- 40.99 sec
-- On an SL machine, which seems to be about the same speed (perhaps a bit faster) than
-- my mac:
time rank M0 -- 16.72 sec
time rank M1 -- 7.85 sec
time rank M2 -- 52.32 sec
time rank M3 -- 23.9 sec
-- the blas code appears to be running somewhat more than 7 times slower on
-- SL than on the mac. I think ubuntu is similar to SL in speed here.
-- perhaps openblas can improve this?
By the way, about my code in the previous post: sorry, I chose a time inefficient manner to create these matrices.
@DanGrayson this one is important :)
by the way, Sage has switched to openblas.
Generic lapack and blas don't take advantage of cpu cores and CPU vectorization (e.g. SSE2, which is ubiquitous now). Here's some information about benchmarking in numpy: https://markus-beuckelmann.de/blog/boosting-numpy-blas.html On that note, Eigen's API is different, so we would have to change our code, but it seems to be a great contender: http://eigen.tuxfamily.org/index.php?title=Benchmark
By the way, Sage switched to openblas years ago.
With the CMake build, we have, too! Hopefully the autotools build is next.
Here's a quick benchmark. First I had to comment out everything after line 320 of quarantine/lapack.m2
since an engine routine is failing for matrices with zero rows or columns.
Using OpenBLAS:
[mahrud@noether build]$ ctest -R lapack --repeat-until-fail 10
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/build
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.72 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.78 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.77 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.89 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.84 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.93 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.92 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 2.01 sec
Start 3243: quarantine/lapack.m2
Test #3243: quarantine/lapack.m2 ............. Passed 1.99 sec
Start 3243: quarantine/lapack.m2
1/1 Test #3243: quarantine/lapack.m2 ............. Passed 2.09 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 19.18 sec
Compared with LAPACK/BLAS:
[mahrud@noether blas]$ ctest -R lapack --repeat-until-fail 10
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/blas
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 3.04 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 3.12 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 3.17 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 3.20 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 3.16 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 3.13 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 2.93 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 2.96 sec
Start 522: quarantine/lapack.m2
Test #522: quarantine/lapack.m2 ............. Passed 2.97 sec
Start 522: quarantine/lapack.m2
1/1 Test #522: quarantine/lapack.m2 ............. Passed 2.94 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 30.72 sec
That's a 37% improvement.
The effect on gb tests is even more significant with 58% improvement.
OpenBLAS:
[mahrud@noether build]$ ctest -R normal/gb
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/build
Start 2974: normal/gb-matrix-lift.m2
1/14 Test #2974: normal/gb-matrix-lift.m2 ......... Passed 0.51 sec
Start 2975: normal/gb-skew-ZZ.m2
2/14 Test #2975: normal/gb-skew-ZZ.m2 ............. Passed 0.53 sec
Start 2976: normal/gb-snapp-bug.m2
3/14 Test #2976: normal/gb-snapp-bug.m2 ........... Passed 0.50 sec
Start 2977: normal/gb2.m2
4/14 Test #2977: normal/gb2.m2 .................... Passed 0.51 sec
Start 2978: normal/gbQQbug.m2
5/14 Test #2978: normal/gbQQbug.m2 ................ Passed 0.57 sec
Start 2979: normal/gbZZ-2.m2
6/14 Test #2979: normal/gbZZ-2.m2 ................. Passed 0.52 sec
Start 2980: normal/gbZZ-mingens.m2
7/14 Test #2980: normal/gbZZ-mingens.m2 ........... Passed 0.56 sec
Start 2981: normal/gbZZ13.m2
8/14 Test #2981: normal/gbZZ13.m2 ................. Passed 0.78 sec
Start 2982: normal/gbZZautoreduction.m2
9/14 Test #2982: normal/gbZZautoreduction.m2 ...... Passed 0.52 sec
Start 2983: normal/gbZZbug.m2
10/14 Test #2983: normal/gbZZbug.m2 ................ Passed 1.33 sec
Start 2984: normal/gbZZbug2-a.m2
11/14 Test #2984: normal/gbZZbug2-a.m2 ............. Passed 0.61 sec
Start 2985: normal/gbZZbug2.m2
12/14 Test #2985: normal/gbZZbug2.m2 ............... Passed 0.62 sec
Start 2986: normal/gbinhom.m2
13/14 Test #2986: normal/gbinhom.m2 ................ Passed 0.52 sec
Start 2987: normal/gblimits.m2
14/14 Test #2987: normal/gblimits.m2 ............... Passed 0.54 sec
100% tests passed, 0 tests failed out of 14
Total Test time (real) = 8.84 sec
BLAS/LAPACK:
[mahrud@noether blas]$ ctest -R normal/gb
Test project /home/mahrud/Projects/M2/M2/M2/BUILD/blas
Start 3007: normal/gb-matrix-lift.m2
1/14 Test #3007: normal/gb-matrix-lift.m2 ......... Passed 1.33 sec
Start 3008: normal/gb-skew-ZZ.m2
2/14 Test #3008: normal/gb-skew-ZZ.m2 ............. Passed 1.32 sec
Start 3009: normal/gb-snapp-bug.m2
3/14 Test #3009: normal/gb-snapp-bug.m2 ........... Passed 1.35 sec
Start 3010: normal/gb2.m2
4/14 Test #3010: normal/gb2.m2 .................... Passed 1.33 sec
Start 3011: normal/gbQQbug.m2
5/14 Test #3011: normal/gbQQbug.m2 ................ Passed 1.38 sec
Start 3012: normal/gbZZ-2.m2
6/14 Test #3012: normal/gbZZ-2.m2 ................. Passed 1.35 sec
Start 3013: normal/gbZZ-mingens.m2
7/14 Test #3013: normal/gbZZ-mingens.m2 ........... Passed 1.42 sec
Start 3014: normal/gbZZ13.m2
8/14 Test #3014: normal/gbZZ13.m2 ................. Passed 1.73 sec
Start 3015: normal/gbZZautoreduction.m2
9/14 Test #3015: normal/gbZZautoreduction.m2 ...... Passed 1.41 sec
Start 3016: normal/gbZZbug.m2
10/14 Test #3016: normal/gbZZbug.m2 ................ Passed 2.25 sec
Start 3017: normal/gbZZbug2-a.m2
11/14 Test #3017: normal/gbZZbug2-a.m2 ............. Passed 1.59 sec
Start 3018: normal/gbZZbug2.m2
12/14 Test #3018: normal/gbZZbug2.m2 ............... Passed 1.55 sec
Start 3019: normal/gbinhom.m2
13/14 Test #3019: normal/gbinhom.m2 ................ Passed 1.45 sec
Start 3020: normal/gblimits.m2
14/14 Test #3020: normal/gblimits.m2 ............... Passed 1.37 sec
100% tests passed, 0 tests failed out of 14
Total Test time (real) = 21.09 sec
What remains is to switch the autotools build over to openblas.
If you don't want to build your own openblas, openblas comes with openblas.pc, i.e. you can get info about it via pkg-config, or rather, PKG_CHECK_MODULES etc. Here is what we do in Sage https://github.com/sagemath/sage/blob/develop/build/pkgs/openblas/spkg-configure.m4
Admittedly, complicated - the problem is that different Linux distros package openblas differently, sometimes you need a separate libcblas, etc (but please ask questions about it, I wrote an initial version of that monster after all :-))
Fixed in #3461
Word on the street is that this blas is far better than the default blas, on Ubuntu at least, and probably other linuxes too. After compiling it, I think that the new library just needs to be added on the link library list. It was recommended to me that we compile it from source, to better use the facilities on each target machine. However, we should probably also allow the use of the ubuntu openblas package for building distributions.
It would be nice to allow the use of openblas, and if it is actually far superior, make it the default. On mac's though, we currently use the Accelerate framework, which seems to be very good. Even there though, it might be good to compare them.
There are several reasons for this request, but my main interest right at the moment is to improve the speed of rank computations in ffpack (which is used in the fast non-minimal free resolution code). Currently, if I compare across machines, I find that Ubuntu is perhaps 5-10 times slower at such computations than on my mac laptop, which is a year or two old.
I will add in benchmarks to check this, and so we can see what any actual improvement is.