flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.3k stars 367 forks source link

multithread by default? #292

Closed cdluminate closed 5 years ago

cdluminate commented 5 years ago

Widely used BLAS implementations such as MKL, OpenBLAS, enabled multithread by default. As a BLAS drop-in replacement I don't expect every user to be aware of the environment variable to set number of threads...

Why is BLIS running in single thread mode by default, unlike MKL/OpenBLAS?

fgvanzee commented 5 years ago

@cdluminate This is a good question. Without consulting with any of my collaborators, I would say that originally we didn't want to assume that OpenMP or pthreads was available. But now, pthreads is an unconditional dependency (for library initialization via pthread_once()) and OpenMP is pretty ubiquitous, especially via gcc.

@devinamatthews What do you think? Should we default to -t openmp? Can you think of any reason this would cause problems for some users?

fgvanzee commented 5 years ago

@cdluminate I misread your message originally. I now realize that you are not only proposing that we enable threading at configure-time by default, but that we max out the number of threads employed too? (In my previous post, I was only thinking of enabling threading, but leaving the number of threads used to 1 by default.)

devinamatthews commented 5 years ago

@fgvanzee The only problem (besides Windows ☹️) with unconditionally assuming OpenMP is people using Apple's compilers on macOS. However, if the default was "OpenMP, unless it's not available then Pthreads" then we would be fine.

As for the second issue (default number of threads), basically all of the major BLAS libraries automatically use as many threads as you have cores (Accelerate, OpenBLAS, MKL, etc.). The problem here is determining how many cores you have, especially with hyperthreading and on Bulldozer etc. hwloc is the most portable option but is not always (even usually maybe?) available. Other than that you have to use a mish-mash of platform-specific techniques.

fgvanzee commented 5 years ago

@fgvanzee The only problem (besides Windows frowning_face) with unconditionally assuming OpenMP is people using Apple's compilers on macOS. However, if the default was "OpenMP, unless it's not available then Pthreads" then we would be fine.

Thanks Devin. In principle, I am fine with "OpenMP, unless not available in which case we use pthreads."

The problem here is determining how many cores you have, especially with hyperthreading and on Bulldozer etc. hwloc is the most portable option but is not always (even usually maybe?) available. Other than that you have to use a mish-mash of platform-specific techniques.

@cdluminate This second issue is tricky. Maxing out the threads is troublesome for people who only want application-level multithreading, and therefore want BLIS to execute sequentially (use only 1 thread internally)... though I suppose either way, we are inconveniencing one group or the other. But even if you assume that all users want all threads used by BLIS, we really only want one thread per physical core, which, as Devin alluded, can be difficult to determine. There is also the problem of determining the best thread-to-core affinity mapping. It is for these and related reasons that we would continue to default to using one thread even if OpenMP or pthreads is enabled by default.

Someday we hope to have a better solution to all this, but for now, I think remaining conservative about default numbers of threads is the least bad option.

cdluminate commented 5 years ago

Thanks for the answer and it sounds reasonable.

jeffhammond commented 5 years ago

Defaulting to single-threaded execution is the right thing to do in an HPC ecosystem where MPI parallelism is ubiquitous but a modest subset of codes use threads properly. I have no issue with enabling threads by default in the build, but runtime execution on multiple threads should required the user to choose this.

rvdg commented 5 years ago

@jeffhammond I am wondering: We are getting to the point where people who have been using, say, OpenBLAS are venturing to try BLIS. To make sure that such people, who are often not experts, get an apples-to-apples comparison on their first try, we may want to set our defaults regarding multithreading to be identical to the defaults that OpenBLAS uses, whatever they may be.

Comments?

cdluminate commented 5 years ago

@rvdg 's point is exactly what I wanted to say. People would get surprised discovering their BLAS dropped back to single thread mode when they switched to BLIS to give it a shot. So this problem is to some extent equivalent to choosing one from newbies and those who have enough background on BLAS...

In particular the way to "switch BLAS" I mentioned refers to [1]. Many of the available alternatives use non-single thread mode by default even if the user specifies no environment variable. This is another point that supports multithreading by default.

Debian has package popularity contest for blis[2], and this data might be helpful in the future. But at present we haven't gathered enough data to deduce how most BLIS user wanted to use blis.

[1] https://wiki.debian.org/DebianScience/LinearAlgebraLibraries [2] https://qa.debian.org/popcon.php?package=blis

fgvanzee commented 5 years ago

@rvdg 's point is exactly what I wanted to say. People would get surprised discovering their BLAS dropped back to single thread mode when they switched to BLIS to give it a shot. So this problem is to some extent equivalent to choosing one from newbies and those who have enough background on BLAS...

Yes, your original inquiry vis-a-vis default number of threads makes more sense now that I understand the context.

I wish there were a way to accommodate everyone simultaneously.

@devinamatthews @jeffhammond I basically still agree with you both. However, I wonder if there were a compromise possible. Given that @cdluminate has decided for BLIS to have multiple packages in Debian for users to choose from [2], what if we provided Zhou's desired behavior in a separate set of Debian packages? Something like libblis[64]-openmp-max. (This would require us to figure out how best to estimate the max number of threads, and use that number even if no environment variables are set, provided a non-default configure option is given. But let's assume we can figure out how to do that in a way that works more often than it doesn't.) Granted, this would result in more Debian packages for Zhou to maintain, but what do you think of the potential behavioral downsides? That is, if a non-idiot user reaches for the -max package, should we trust that they already (or will eventually) realize that it should not be used unless they expect a maxing out of threads, and that if they want/need serial execution that they should either set an environment variable (export BLIS_NUM_THREADS=1) or use a (different) package that defaults to 1 thread? Would this do more good than harm?

I'm not (yet) signing up to figure out how to make this happen. But I wanted to gauge whether this would be a feasible path forward, in principle. And of course, @cdluminate should also comment on this idea.

Debian has package popularity contest for blis[2], and this data might be helpful in the future. But at present we haven't gathered enough data to deduce how most BLIS user wanted to use blis.

Thanks for sharing this. It will be interesting to watch the data accumulate over time.

[2] https://qa.debian.org/popcon.php?package=blis

devinamatthews commented 5 years ago

Why don't we just do what MKL does and use all the threads unless we're in an active OpenMP region (assuming OpenMP enabled)? The second part is not much extra work. For the first part, I have some of this in TBLIS.

hominhquan commented 5 years ago

As I see on CentOS 7, OpenBLAS is distributed in various package flavors:

openblas-devel.x86_64 : Development headers and libraries for OpenBLAS
openblas-static.x86_64 : Static version of OpenBLAS
openblas.x86_64 : An optimized BLAS library based on GotoBLAS2
openblas-openmp.x86_64 : An optimized BLAS library based on GotoBLAS2, OpenMP version
openblas-openmp64.x86_64 : An optimized BLAS library based on GotoBLAS2, OpenMP version
openblas-serial.x86_64 : An optimized BLAS library based on GotoBLAS2, serial version
openblas-serial64.x86_64 : An optimized BLAS library based on GotoBLAS2, serial version
openblas-threads.x86_64 : An optimized BLAS library based on GotoBLAS2, pthreads version
openblas-threads64.x86_64 : An optimized BLAS library based on GotoBLAS2, pthreads version

User then can choose to install the package in his convenience, and, in case of openblas-openmp, link his application to -lopenblaso (one compiled with OpenMP enabled by default) instead of -lopenblas (single-threaded one).

May we do the same ? define a default single-thread BLIS package and provide others multi-threaded versions alongside.

jeffhammond commented 5 years ago

@devinamatthews What MKL is a disaster unless you also detect when multiple processes are using the library at the same time. There are multiple ways to do this that vary in their degree of evilness. I think there are approximately three classes of methods:

  1. Focus on MPI and detect the MPI process launcher environment and see how many physical cores are available to each process. I think I can do this relatively well, although I am concerned about dealing with HyperThreads/SMT.

  2. Use static data in the BLIS shared object to see how many times it has been enabled. This isn't really my expertise and I suspect it is evil anyways. However, I think TBB does something like this to mitigate oversubscription.

  3. Use performance monitoring to understand the actual system load and use the resources that are available. This is the hardest but most effective method.

If we were to implement one or more of these so as to provide a high-quality HPC user experience with BLIS even in the presence of nontrivial hardware subscription, it would be a huge selling point.

devinamatthews commented 5 years ago

Aren't MPI users trained to set OMP_NUM_THREADS appropriately, though? I would think pthreads users would be most at risk.

jeffhammond commented 5 years ago

@devinamatthews Not if they don't know their BLAS library is going to use threads automatically. For example, I frequently build NWChem with the sequential MKL library so that I do not have deal with OpenMP later, or because I actually want to make single-threaded MKL calls from OpenMP tasks in the application.

jeffhammond commented 5 years ago

@devinamatthews Some HPC environments deal with MPI+OpenMP automatically, hence do not require users to set OMP_NUM_THREADS. Slurm (srun) and Cray ALPS (aprun) have hooks for this from what I remember, although I do not use them.

loveshack commented 5 years ago

If we were to implement one or more of these so as to provide a high-quality HPC user experience with BLIS even in the presence of nontrivial hardware subscription, it would be a huge selling point.

I agree with the basic HPC issues, of course, specifically with MKL. However, from system management experience I'd be surprised if second-guessing the user could be reliable; I'd be happy to be proved wrong, obviously! (A reliable version should be applicable more generally than BLIS, and I wonder if sparse effort would be better spent elsewhere.)

I think I referred to it elsewhere, but https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919272 has examples where packaging needs all three threading variants; the cp2k example is similar to Jeff's NWChem, and other MPI usage. I can probably expand on this if I think about it and it's helpful to have another HPC perspective. I still think it's really a packaging issue -- with both operational and packager hats on.

fgvanzee commented 5 years ago

@jeffhammond In your above comment, are you saying that these HPC environments set OMP_NUM_THREADS on the user's behalf, as appropriate, given how MPI is being used?

jeffhammond commented 5 years ago

@fgvanzee I believe that Intel MPI and Cray MPI/ALPS/SLURM do that automagically. It's not something you can assume in general.

cdluminate commented 5 years ago

https://github.com/explosion/thinc/releases/tag/v7.0.0. I found a downstream that wants the default single thread mode.

isuruf commented 5 years ago

@cdluminate, not in the next release though. explosion/cython-blis switched to expert API to run in single threaded mode.

honnibal commented 5 years ago

Ah, please don't worry about that downstream --- I'll happily make sure to control the thread mode as we want it, regardless of the default. Currently I do this by compiling the library to force single-threaded execution, but this may change to a runtime setting.

My experience as a library maintainer is that users are receiving very bad experiences from the default "launch all the threads" behaviours of other BLAS libraries. People are pretty much running in three types of environment:

a) A personal workstation or laptop; b) A VM on the cloud; c) A large bare-metal server, with like 50-100 cores.

Launching lots of threads is really only good in situation a). If you're on the cloud, you usually want to split your work into smaller chunks and provision smaller VMs. In situation c), launching lots of threads leads to really poor results. Someone checks the machine usage and sees a process at 6000%, so they figure they should wait to run their job. Meanwhile that job using 60 cores would often run faster if limited to 4 cores.

Situation c is a very common type of set up. It's by far the cheapest way to do GPU experimentation, so most data science teams (and many university research groups) are computing in this type of set up. I've talked to a great many people working like this who are very smart, but just don't happen to know that they shouldn't trust the default threading behaviour if they write A @ B in their Python code. It's very reasonable to believe that using more resources will make your program run faster.

It's true that when people run their first benchmarks in situation a), they might get worse numbers from Blis than from the equivalent OpenBLAS or MKL. But I think it's much better to give a default that produces better actual results, when people are doing real work in situations b) or c).

jeffhammond commented 5 years ago

It's fine to do a multithreaded build by default but it's dangerous to enable multiple threads at runtime by default.

rvdg commented 5 years ago

I think the default should be that if you don’t specify whether you one single threaded or multithreaded by default, then you don’t get a functional library.

In other words, either at build time or at run time a user has to specify, or there will be an error.

fgvanzee commented 5 years ago

either at build time or at run time

@rvdg Based on our earlier in-person conversations, I don't think you mean to include the "or at build time" caveat. Jeff makes a really good point: debating the default for the --enable-threading option to configure is different than debating the default runtime behavior of a library that was built with multithreading.

rvdg commented 5 years ago

Works for me.

loveshack commented 5 years ago

This seems quite wrong to me as a research computing support person and distribution packager, unless I've misunderstood something. I previously gave three examples from Debian, where this seems to originate; as far as I know, each needs a BLAS library built with a different threading option.

The current default serial version is perfectly fine. Anything else leads to tears, as Jeff pointed out for MKL. As a packager I'd rather BLIS experts spent time on other things. The HPC example I know best (pace NWCHEM people!) is CP2K -- one of the Debian examples. It specifically wants a serial BLAS and has to take special measures to control MKL. You control the shared memory parallelism at runtime with OMP_NUM_THREADS, and parallelism is provided by the program, not the libraries. Threaded BLAS or FFT just causes problems by using double the right number of threads. (I see users actually complain of such problems in various cases.)

At least on an ELFy system, if you want to use a different model of BLAS than the program was linked against, you just LD_PRELOAD it, or use trivial shared library shims on LD_LIBRARY_PATH, as I ht ink I've referred to elsewhere.

The sane serial default does cause people who should know better to perpetuate myths about MKL speed, but I normally don't see them swayed by experimental evidence, so there are worse issues involved :-(.

fgvanzee commented 5 years ago

Thanks for your comments, @loveshack. Your perspective as a package manager is appreciated.

rvdg commented 5 years ago

My comment was merely meant to further instigate discussion. I’ll now let Field sort through the responses!

fgvanzee commented 5 years ago

I tend to think that, given a multithreaded build of libblis, there is not yet sufficient demand for automatically using many threads at runtime. And in fact, there seems to be several good reasons to default to single-threaded, chief among them is that it's predictable and sane for people who are computing real-world problems (as opposed to running solely for the purposes of benchmarking). I tend to also like the single-threaded default at runtime because it forces people (or at least gives a very strong nudge) to learn how to express and request the appropriate amount of parallelism for their application. Forcing new users to learn more about the software[1] can be an inconvenience in the moment, but in the long run I think it will universally serve them well. And as Dave said, I'm fine with BLIS occasionally appearing to yield lower performance to those who are only giving it a cursory, didn't-read-the-docs attempt. These users' hastiness/ignorance doesn't change reality, and it doesn't change how we measure performance[2], which is methodically, thoughtfully, and with extensive experience and knowledge of how those measurements are best and most fairly gathered.

Footnotes: [1] And let's be honest here: we're not asking that much of these users. All they have to do is read our Multithreading document. [2] We'll be publishing performance graphs of several architectures soon on the BLIS github homepage, which will make it easy for people to see how we compare against the competition.

jeffhammond commented 5 years ago

@loveshack Multiple builds in the packaging system is a cowardly solution that refuses to engage software engineering intelligence. In some cases, merely compiling with multithreading enabled can reduce library performance. It's possible to work around this and end up with approximately two branches worth of overhead, but many programmers can't handle this. I think BLIS can.

jeffhammond commented 5 years ago

@fgvanzee "But all they have to do is read the documentation" - remember that you live in a country where people need to be told to not eat Tide pods.

cdluminate commented 5 years ago

I've seen convincing rationales and I think this discussion is reaching a point that single-threaded mode by default is saner than automatic multithreaded mode. Maybe this issue can be closed after a documentation update?

BTW, so for Debian packaging I should let the users install blis-serial by default if they dont specify any threading model?

fgvanzee commented 5 years ago

Maybe this issue can be closed after a documentation update?

@cdluminate Feel free to propose suggested edits (either here or in another issue).

Is it mainly that you want it to be extra crystal clear that (a) BLIS builds sequential by default, and (b) even when built with multithreading enabled, it will run sequentially by default?

cdluminate commented 5 years ago

Is it mainly that you want it to be extra crystal clear that (a) BLIS builds sequential by default, and (b) even when built with multithreading enabled, it will run sequentially by default?

Yes, I think Multithreading.md can be updated to make the two points clear, as they are notable differences between BLIS and OpenBLAS/MKL.

fgvanzee commented 5 years ago

BTW, so for Debian packaging I should let the users install blis-serial by default if they dont specify any threading model?

I'd like others to comment on this, but as far as I understand the question and context, I think that would work. I also think an OpenMP build would be a reasonable default, since it would run sequentially by default, but allow multithreading if the user/application decided to request it.

What do you think, @jeffhammond, @loveshack, @honnibal, @hominhquan, and @devinamatthews?

devinamatthews commented 5 years ago

In my opinion, what is best for HPC and what is best for desktop is not the same thing, but in the context of packaging for Debian the latter is more important. No serious HPC setup uses the OS package manager for important software like math libraries. As for what the default behavior should be on desktop, I would tend to go with whatever the other libraries do. For OpenBLAS this is build with threads and then use them automatically. For MKL I believe this is the same (not 100% what happens with no env. vars. set), but it also detects active OpenMP regions and reverts to sequential which we could also do. I guess default to OpenMP but run in serial by default could be a reasonable compromise, but we absolutely will get complaints from users about "poor performance" relative to OpenBLAS. HPC users should be competent enough to configure and use the library appropriately, and if not BLIS will not be the only software they have problems with.

loveshack commented 5 years ago

I tend to also like the single-threaded default at runtime because it forces people (or at least gives a very strong nudge) to learn how to express and request the appropriate amount of parallelism for their application. Forcing new users to learn more about the software[1] can be an inconvenience in the moment, but in the long run I think it will universally serve them well.

You are very wise, and it's not as if it shouldn't be trivial to change at runtime with dynamic libraries. I just wish many people didn't know more about topics they haven't worked on than people who have, and were easier to educate.

loveshack commented 5 years ago

@loveshack Multiple builds in the packaging system is a cowardly solution that refuses to engage software engineering intelligence.

Insulting people's intelligence who do portable performance engineering R&D won't encourage contributors, but I'd be grateful for insight from people with more experience of linear algebra packaging than me. (I hope that's not how it was actually intended.)

In some cases, merely compiling with multithreading enabled can reduce library performance.

Well, yes.

It's possible to work around this and end up with approximately two branches worth of overhead, but many programmers can't handle this. I think BLIS can.

What BLIS can do in isolation probably isn't relevant to global issues in a distribution. What do you have in mind to help?

loveshack commented 5 years ago

BTW, so for Debian packaging I should let the users install blis-serial by default if they dont specify any threading model?

I don't think the threading possibilities should be alternatives to each other -- as opposed to different implementations; they're not in Fedora. In particular, it seems Debian numpy specifically expects a pthreads version (or has that changed?). That doesn't exclude compatibility shims manipulated via LD_LIBRARY_PATH etc., of course, as in Fedora.

fgvanzee commented 5 years ago

Yes, I think Multithreading.md can be updated to make the two points clear, as they are notable differences between BLIS and OpenBLAS/MKL.

@cdluminate These changes have been made. Please take a look at the Multithreading.md document at your convenience.

cdluminate commented 5 years ago

@fgvanzee Wonderful. Thank you so much!

fgvanzee commented 5 years ago

Even though @cdluminate, the original poster of this issue, is now satisfied, I'd like to invite others to continue their previous discussions on this thread, if they so choose. Thank you everyone for sharing your thoughts so far.