Closed cdluminate closed 5 years ago
@cdluminate Thank you for your interest in BLIS. We have been eagerly waiting for the right person from the Debian community to come along and "sponsor" our project in the Debian/Ubuntu universe. :)
- Is BLIS mature enough to be used as a drop-in replacement of libblas.so.3 ?
My first instinct is to answer with an emphatic "yes." However, the framing of your question is ultimately subjective, and also dependent on the actual implementation referred to by libblas.so.3
, as my understanding is that libblas.so.3
is merely a generic symlink to the current BLAS library. The actual shared library to which it links would be some specific implementation, whether it is the Fortran-77 reference library from netlib, OpenBLAS, or something else.
The lead developers of the BLIS project exercise great care in their approach to software development. One of our guiding principles is to try to "get it right" the first time so we are much less likely to have to revisit/fix it in the future, and we believe this methodical approach pays dividends in long run. We still make mistakes from time to time, but I think the public record here on github shows that we are quite responsive to the community's feedback, especially with bug reports. We sometimes fix issues within hours of them being reported, and BLIS has multiple tools for checking correctness at our disposal, including a comprehensive BLIS testsuite, a C translation of the netlib BLAS test drivers, and integration with Travis CI that uses the former two mechanisms to test multiple hardware configurations via Intel's software development emulator (SDE).
So, to conclude, the answer to your question is "probably," but it depends on what you expect of libblas.so.3
. BLIS is still under active development, even if the core functionality exists in a mostly stable state. Whether BLIS belongs in Debian as an officially-supported package may also depend on how frequently our sponsor will be willing to provide updated packages. The more frequent, the better. (I would hope we would have the opportunity to push our latest code to Debian at least monthly, even if we don't need to exercise each opportunity.)
- Is there any benchmark of which compares BLIS with other BLAS implementations such as OpenBLAS?
Yes. We have performed many performance experiments that compare BLIS against OpenBLAS. Last week, at our annual BLIS Retreat--a workshop here at UT-Austin centered around BLIS-related topics--Devangi Parikh (@dnparikh) presented performance results for multiple level-3 BLAS operations, floating-point datatypes, and problem sizes, and she did so for Intel Haswell/Broadwell, Intel SkylakeX, and Cavium ThunderX2 (ARMv8). The overall story of the performance results is that BLIS is remarkably competitive and consistent in its performance.
Devangi: could you link our guest to PDFs of your graphs from the Retreat?
Also calling out to @nschloe so he can comment about BLIS in general, if he likes.
Also calling out to @nschloe so he can comment about BLIS in general, if he likes.
I found BLIS as I was looking for BLAS operations on C-ordered arrays for NumPy. BLIS has that, but even better is the fact that it's developed in the open using a more modern language than Fortran.
The overall story of the performance results is that BLIS is remarkably competitive and consistent in its performance.
Plots about that should definitely go into the main README.
@nschloe Thanks for your comments, Nico. I agree that the time has come for us to include some basic plots in the source distribution.
@dnparikh Perhaps we should skip straight to Nico's idea instead, and then @cdluminate can view them through github or a git
clone.
Comments:
We just had our "BLIS Retreat" (http://www.cs.utexas.edu/users/flame/BLISRetreat2018/). If you click on "program" and then scroll down to Devangi's talk, you will find http://www.cs.utexas.edu/users/flame/BLISRetreat2018/Slides/Devangi_BLIS_Retreat_2018_poster.pdf.
Field probably should have pointed out that BLIS is what AMD now distributes as part of its "ACL" replacement of the ACML library. This speaks to the stability of the library.
Robert
@fgvanzee This is the background for my first question:
Debian/Ubuntu have an alternatives system, by which the user can switch the BLAS implementation smoothly without recompiling any software, e.g.
$ sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
There are 4 choices for the alternative libblas.so.3-x86_64-linux-gnu (providing /usr/lib/x86_64-linux-gnu/libblas.so.3).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 40 auto mode
1 /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3 35 manual mode
2 /usr/lib/x86_64-linux-gnu/blas/libblas.so.3 10 manual mode
* 3 /usr/lib/x86_64-linux-gnu/libmkl_rt.so 1 manual mode
4 /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 40 manual mode
Press <enter> to keep the current choice[*], or type selection number: ⏎
All the alternative candidates for the libblas.so.3
shared object provides at least the standard set of BLAS API/ABI. In my question, my term libblas.so.3
means "standard BLAS ABI/API".
If BLIS's CBLAS implementation is mature enough, it could be used as a drop-in replacement of libblas.so.3
, and it should be added as another alternative for libblas.so.3
.
BLIS is still under active development, even if the core functionality exists in a mostly stable state.
That's good to hear.
Whether BLIS belongs in Debian as an officially-supported package may also depend on how frequently our sponsor will be willing to provide updated packages. The more frequent, the better.
I have enough permission to upload package to Debian. However when Debian is about to release a new version, .e.g. 10.0 Buster
, the whole archive will be frozen and nothing can be updated except that some package has severe bug. I can only update packages regularly for debian testing/unstable/experimental. So it makes sense to ask upstream before uploading to make sure the package to be uploaded isn't too buggy.
(I would hope we would have the opportunity to push our latest code to Debian at least monthly, even if we don't need to exercise each opportunity.)
It won't be hard for me to update a Debian package as long as there is neither significant ABI/API change, nor significant change in build system.
Plots about that should definitely go into the main README.
+1
And thanks @rvdg for the plot.
One more question about packaging:
As shown in https://github.com/flame/blis/issues/255, BLIS has the best performance with openmp threading, as long as BLIS_NUM_THREADS
is properly configured. This means the default configuration for the package should be
--enable-threading=openmp x86_64
Is this correct?
And does it make sense to provide another --disable-threading
BLIS at the same time to avoid threading library clash under some certain conditions?
--enable-threading=openmp x86_64
Seems right, yes. You may want to do a thorough review of all configure options. For example, the BLAS integer size option is important to some people.
And does it make sense to provide another
--disable-threading
to avoid threading library clash under some certain conditions?
Could you clarify this question?
@fgvanzee I spotted significant performance drop when a program used iomp and gomp at the same time.
Besides, Intel MKL provides the sequential threading (single thread) module. https://software.intel.com/en-us/mkl-linux-developer-guide-calling-intel-mkl-functions-from-multi-threaded-applications
Usage model: disable Intel MKL internal threading for the whole application
When used: Intel MKL internal threading interferes with application's own threading or may slow down the application.
@cdluminate Sounds like you are trying to link your gcc
-compiled/linked libblis.so
to your application with Intel's icc
?
@devinamatthews @jeffhammond You guys have more experience with using icc
. Could you comment here? I don't quite understand the issue.
EDIT: BTW everyone, I'm taking the day off today. :) Thanks for your patience, @cdluminate . Hopefully others can step in and help us figure out your issues.
@cdluminate I would guess that when using both iomp5 and gomp at the same time that you are ending up with N^2 threads. What happens when you run with OMP_NUM_THREADS=N and BLIS_NUM_THREADS=1 (assuming there is meaningful multithreading in the calling program)?
But, even if it "works" right now, mixing two different OpenMP runtimes is a recipe for disaster. In the context of a Debian package, I would think that gomp would make a sensible default. I guess that pthreads would be even better but as noted above we don't have a thread pool implementation and so performance suffers.
Intel OpenMP runtime defines the GOMP API so if you link it into a GCC program, you should not end up linking against libgomp.so
. Please run ldd
on your binary to see what libraries are being linked.
@fgvanzee @devinamatthews @jeffhammond Thanks for the pointers. But I'm sorry that observation was found in one of my old cxx code that doesn't use BLIS ... and using different threading libraries at the same time resulted in creation of too many threads. GCC uses gomp by default but clang doesn't ...
Anyway I think the debian package should ship with openmp version first.
Just a NOTE: I registered libblis as an alternative candidate to libblas.so.3 and libblas.so, and assigned blis with priority 37. I would be happy to increase the priority of BLIS to higher value if there are strong proof that suggests BLIS is better than OpenBLAS, in terms of cblas_* performance.
The default priority chain for Debian and Ubuntu will look like this:
OpenBLAS > BLIS > Atlas > Netlib
Justification:
OpenBLAS > BLIS: if we don't touch any environment variable at all
BLIS > Atlas: BLIS in single thread (according to my test) is still faster than generic Atlas
~/D/b/blis ❯❯❯ sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
There are 5 choices for the alternative libblas.so.3-x86_64-linux-gnu (providing /usr/lib/x86_64-linux-gnu/libblas.so.3).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 40 auto mode
1 /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3 35 manual mode
2 /usr/lib/x86_64-linux-gnu/blas/libblas.so.3 10 manual mode
3 /usr/lib/x86_64-linux-gnu/libblis.so.1 37 manual mode
* 4 /usr/lib/x86_64-linux-gnu/libmkl_rt.so 1 manual mode
5 /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 40 manual mode
Press <enter> to keep the current choice[*], or type selection number: ^C⏎ ~/D/b/blis ❯❯❯ sudo update-alternatives --config libblas.so-x86_64-linux-gnu
There are 4 choices for the alternative libblas.so-x86_64-linux-gnu (providing /usr/lib/x86_64-linux-gnu/libblas.so).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/x86_64-linux-gnu/openblas/libblas.so 40 auto mode
1 /usr/lib/x86_64-linux-gnu/blas/libblas.so 10 manual mode
2 /usr/lib/x86_64-linux-gnu/libblis.so 37 manual mode
* 3 /usr/lib/x86_64-linux-gnu/libmkl_rt.so 1 manual mode
4 /usr/lib/x86_64-linux-gnu/openblas/libblas.so 40 manual mode
Press <enter> to keep the current choice[*], or type selection number: ^C⏎
Please ignore the lowest priority 1
assigned to MKL.
I registered libblis as an alternative candidate to libblas.so.3 and libblas.so, and assigned blis with priority 37.
@cdluminate That's great news. We appreciate your willingness to include BLIS in the Debian universe. I'm sure many of our users will be pleased to be able to enable BLIS via the distribution's internal package management tools.
I would be happy to increase the priority of BLIS to higher value if there are strong proof that suggests BLIS is better than OpenBLAS, in terms of cblas_* performance.
It all depends on you hardware, operation, datatype, problem size, and how much threading you need, and maybe other factors as well. That said, Robert has pointed you to evidence that we gathered very recently on ThunderX2 and SkylakeX--two architectures designed for high performance. (Devangi didn't include OpenBLAS on the Haswell graphs because we couldn't get it working in multithreaded mode, despite intense efforts by deeply experienced individuals.) For now, OpenBLAS outperforms BLIS for small problems (less than about 100), but that is one of the only use cases for which OpenBLAS outpaces BLIS. The evidence suggests that, for almost all larger problems, almost all operations of almost all datatypes seem to yield better performance when executed via BLIS than with OpenBLAS.
Also consider that there are other metrics by which to measure "goodness" or "betterness" than raw performance. We believe strongly that BLIS stacks up well against OpenBLAS on virtually all of these other measures of software quality. Most important among these metrics: BLIS provides BLAS (and CBLAS) APIs, but unlike most other BLAS libraries, BLIS provides much more than just BLAS APIs. (The BLAS APIs are quite limiting for many individuals and applications, and BLIS contains APIs that attempt to break free of those limitations and expand the space of parameterization and storage formats.) For a full list of features that BLIS provides, I invite you to read our main github page, whose content is stored in the README.md
file.
@cdluminate Clang uses the LLVM OpenMP runtime, which is the Intel OpenMP runtime and thus contains the GOMP symbols so it too should not lead to O(n^2) threads when layered with GCC. Pthreads+OpenMP can definitely hit this, however, and it is the expected behavior.
For interested parties, the GOMP symbols in KMP (Intel/LLVM OpenMP runtime) can be observed here: https://github.com/llvm-mirror/openmp/blob/master/runtime/src/kmp_gsupport.cpp
Another question: What's the correct configuration parameter to compile the 64-bit-index version of BLIS (or in MKL's term, ILP64 interface)? Is it -i 64 -b 64
?
It depends on what you want. If you only care about the integer size in the BLAS API, then you only need -b 64
. If you want to ensure that the native BLIS APIs use 64-bit integers, you should use -i 64
. (And if you want both, use both.)
One could argue that -b 64 -i 32
is a disaster waiting to happen, but that would only happen with just -b 64
specified on 32-bit architectures. Maybe in this case we should imply -i 64
as well?
Could you explain the precise circumstances under which -b 64 -i 32
would be a disaster and why? I don't see it.
@fgvanzee Do you support the case where a 64-bit BLAS integer exceeds INT_MAX
when the BLIS API uses 32-bit C int
?
(e.g. dot product on a vector of 17 GB of floats)
@jeffhammond No. I don't do anything fancy--just regular typecasts. I assume the developer/user knows what he's doing when he assigns, implicitly or explicitly, the integer size of both BLAS and BLIS integers.
@fgvanzee I encourage you to take a random sample of computational scientists you encounter in ICES or online to see if that assumption is valid.
There is a use case for dangerous truncation, but it is a dubious one. NWChem always uses 64-bit integers because these are used as offsets into distributed arrays (i.e. Global Arrays) and it is easy to have 1D arrays that contain more than INT_MAX
elements yet never pass a local dimension greater than INT_MAX
. However, there isn't a good reason to allow the BLIS API to be built with 32-bit integers for this case, since the additional cost of 64-bit integer arithmetic over 32-bit arithmetic in outer loops of BLIS algorithms should not be noticeable.
A good compromise here is to disallow 64b BLAS + 32b BLIS by default but have an override (e.g. allow_truncation
) for folks who want to live dangerously.
Thanks Jeff. I've opened issue #274 to track this.
update: blis 0.5.0 was built for six architectures on Ubuntu disco. https://launchpad.net/~lumin0/+archive/ubuntu/ppa/+sourcepub/9660195/+listing-archive-extra
The packaging has been significantly changed. In short, the source package yields the following binary packages:
libblis-dev BLAS-like Library Instantiation Software Framework
libblis-openmp-dev BLAS-like Library Instantiation Software Framework
libblis-pthread-dev BLAS-like Library Instantiation Software Framework
libblis-serial-dev BLAS-like Library Instantiation Software Framework
libblis1 BLAS-like Library Instantiation Software Framework - shared library
libblis1-openmp BLAS-like Library Instantiation Software Framework - shared library
libblis1-pthread BLAS-like Library Instantiation Software Framework - shared library
libblis1-serial BLAS-like Library Instantiation Software Framework - shared library
libblis64-1 BLAS-like Library Instantiation Software Framework - shared library
libblis64-1-openmp BLAS-like Library Instantiation Software Framework - shared library
libblis64-1-pthread BLAS-like Library Instantiation Software Framework - shared library
libblis64-1-serial BLAS-like Library Instantiation Software Framework - shared library
libblis64-dev BLAS-like Library Instantiation Software Framework
libblis64-openmp-dev BLAS-like Library Instantiation Software Framework
libblis64-pthread-dev BLAS-like Library Instantiation Software Framework
libblis64-serial-dev BLAS-like Library Instantiation Software Framework
BLIS has been compiled in 6 different configurations as above. libblis1
is a meta package that depends on libblis1-openmp
or libblis1-pthread
or libblis1-serial
. libblis-dev
is also a meta package which depends on libblis1
, and additionally a developement package corresponding to the underlying library. Note that any two of the three variants cannot coexist.
The 64bit variants are similar to those with 32bit indices. And it's soname has been modified to libblis64.so.X
, and registered as a candidate of libblas64.so.3
in the alternatives system. The 64-bit version can co-exist with one 32-bit version.
If this looks good to you, I'll upload the package to Debian unstable shortly after your ack.
@cdluminate This all sounds great. Thanks so much for your contributions.
If I were able to put together a bugfix release (0.5.1) in short order (24 hours?), would it be better for me to do that before you move forward so we can get all the latest commits into the Debian package? (Hopefully slipping in a new version prior to upload won't be too disruptive to you.)
@fgvanzee Just take your time. I'm fine to upload 0.5.1 again after 24 hours. What I concern is that, if 0.5.0 has severe regression bug or something alike, please tell me to stop uploading that.
@cdluminate Thanks. I don't think there are any really bad bugs in 0.5.0--mostly they are more benign improvements--but I'd have to check during my commit review (which is when I write the ReleaseNotes entry) to have a better sense of what bugs were fixed.
https://ftp-master.debian.org/new/blis_0.5.0-1.html Uploaded and pending for ftp team to review.
@cdluminate Sounds good, thanks! I'm working towards 0.5.1 today, tomorrow at the latest.
FYI: BLIS was added to the Gentoo main repository: https://github.com/gentoo/gentoo/commit/5c3ae58036af60b289d7d270753dfa5fda3a3b1b
@cdluminate Very cool. Thanks for letting us know!
I see Nico is working on BLIS packaging. I'm interested in packaging BLIS for Debian based on Nico's work. However I have some questions before doing the actual work: