conda-forge / openblas-feedstock

A conda-smithy repository for openblas.
BSD 3-Clause "New" or "Revised" License
9 stars 39 forks source link

fix threading when building OpenBLAS #9

Closed zym1010 closed 8 years ago

zym1010 commented 8 years ago

fix threading.

conda-forge-linter commented 8 years ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

ocefpaf commented 8 years ago

@zym1010 please bump the build number in the meta.yaml file to get a new build.

zym1010 commented 8 years ago

@ocefpaf Done!

ocefpaf commented 8 years ago

Also, if you are interested in this recipe, consider adding yourself in the maintainers list :wink:

jakirkham commented 8 years ago

So, we had initially disabled threading as there is a lot of bad behavior in OpenBLAS w.r.t. and it has several fun bugs as a consequence.

jakirkham commented 8 years ago

See these two comments and related xref'd issue for details.

zym1010 commented 8 years ago

Thanks. But I think those issues are fixed right? Also, without threading, it's basically useless...

zym1010 commented 8 years ago

I used to compile OpenBLAS myself under Ubuntu 14.04 with default settings and they worked without a problem.

zym1010 commented 8 years ago

Or, if really necessary, open a new fork of multithreaded OpenBLAS... I switched to conda-forge instead of Anaconda because sometimes they make crappy numpy and scipy packages for various reasons, yet their OpenBLAS is multithreaded. I don't want to lose the benefit of multithreaded OpenBLAS.

zym1010 commented 8 years ago

@jakirkham I checked those issues you mentioned. Can't they be avoided by setting environment variable at runtime?

jakirkham commented 8 years ago

Thanks. But I think those issues are fixed right?

As @njsmith says in his comment, they added a workaround. The problem is how they are scheduling threading and that has not changed. For instance, I'm looking at this issue ( https://github.com/xianyi/OpenBLAS/issues/900 ) and it looks like a variant on the exact same problem that he experienced.

zym1010 commented 8 years ago

@jakirkham I see. But can they be avoided by setting flags to restrict threading at runtime, rather than falling back to single-thread for everyone when building? Since I think most people won't encounter such issue...

patricksnape commented 8 years ago

Just to weigh in, given that the user can set the behaviour both via an environment variable or a call into OpenBLAS (in C) it seems sensible that we would build with threading and let the user avoid their own pathological cases?

jakirkham commented 8 years ago

@njsmith, do you have any idea as to the frequency that sched_yield problems occur for the user? In other words, how easy is it to land in a pathological threading case?

njsmith commented 8 years ago

Well, I did it once, and then they fixed that case. So no, not really. But presumably if it were happening to people every day (and in obvious ways), then they would be getting bug reports about it constantly, which doesn't seem to be the case.

njsmith commented 8 years ago

I guess another data point is that the numpy wheels are using openblas with threading enabled for a few months now, and we haven't gotten any complaints.

jakirkham commented 8 years ago

Thanks for the data points and thanks for the proposal @zym1010. Given this seems to be working fine for a large number of users, I'm more then happy to enable it.

jakirkham commented 8 years ago

Appears the whole stack linking to OpenBLAS directly needs to be rebuilt. Unfortunately I missed this before merging. Sorry about that. Perhaps future BLAS changes should be enhancement proposals.

Trying to help move the stack forward to avoid brokenness. Though I can really use others help. Here is what I have done so far. If you depend on NumPy, you should be fine to proceed. If you depend on SciPy, please wait for the deployment of its rebuild you should be fine. Really this is basically done at this point.

cc @conda-forge/core @conda-forge/numpy @conda-forge/scipy @conda-forge/scikit-learn @conda-forge/suitesparse @conda-forge/bob-math @conda-forge/hmat-oss @conda-forge/openturns @conda-forge/coincbc @conda-forge/caffe @conda-forge/cvxopt @conda-forge/simbody

jakirkham commented 8 years ago

Should add that if any builds are getting running into timeouts or getting segmentation faults. It may be that it is assuming to many threads on the CI. As we know 1 thread has worked in the past for these feedstocks (how they were built originally), please trying setting OPENBLAS_NUM_THREADS to 1 before building or testing so as to avoid using too many threads.

jakirkham commented 8 years ago

A question came up in another thread about whether this breaks NumPy extensions that don't use a BLAS.

I don't think so. It seems only things linking to the BLAS need to be rebuilt. This ends up being a much shorter list. That being said, I'm not thrilled with this breakage either. As a test, I have tried a couple numpy extensions (rank_filter, pyfftw) and they seem fine without a rebuild. Would be happy to test (or help others test) more things.

cc @moorepants @ocefpaf

183amir commented 8 years ago

Since old packages that were built with older openblas do not work with this new openblas build, we need to hotfix all the old ones to make sure they don't get installed with this new openblas.

jakirkham commented 8 years ago

So as we have not decided as a community to go through with hotfixing ATM. See discussion in this PR ( https://github.com/conda-forge/conda-forge.github.io/pull/170 ), I'm not really sure we can do that until we agree on hotfixing as a strategy. The old packages shouldn't be installed by default as the newer build number of OpenBLAS and the rebuilt packages is preferred. Thus by default the user gets a working environment once all PRs are merged.

However, I'm aware that you do like to use your old packages, @183amir. Pinning the build number would make this possible and this would be doable if hotfixing were permissible. Though it is not. I can see two other options.

  1. Delete the packages built with the old OpenBLAS and the old OpenBLAS.
  2. Relabel the old packages.

1 seems totally opposite to what you are trying to accomplish even though it "fixes" the problem. 2 could provide a legitimate fix that would work for you without trying to use hotfixing. If 2 is permissible, we should discuss how that relabeling is happening and to what.

This is a bit off topic, but I could see value in stamping packages with labels based on date or some similar granularity. This way people wanting old packages could pull everything in from that date without changing the packages themselves. Maybe worth a bullet point in PR ( https://github.com/conda-forge/conda-forge.github.io/pull/170 )'s overview of strategies.

jakirkham commented 8 years ago

It seems the Travis CI OS X workers are saturated. Not seeing anything building on our end (though it is not 100% obvious from the web interface for me so feel free to correct me). Maybe there is an issue they need to fix. This is probably going to hold up resolution of this issue.

jakirkham commented 8 years ago

🎉 SciPy has been rebuilt! 🎉

ocefpaf commented 8 years ago

A question came up in another thread about whether this breaks NumPy extensions that don't use a BLAS.

I don't think so.

You cannot know that. Some extensions may depend on numpy and openblas but have only numpy specified as dependency while it try to auto-discover the "blas" that numpy brought with it if any. (And note that is is fair as packaging BTW!)

Luckily it seems that scikit-learn had openblas specified and you found it. We will only know if there are others if people report breakages.

jakirkham commented 8 years ago

Some extensions may depend on numpy and openblas but have only numpy specified as dependency while it try to auto-discover the "blas" that numpy brought with it if any.

It is possible there are packages linking to the BLAS and not specifying it. However, this is no longer a BLAS specific problem. This is a general packaging problem on our end that needs addressing. Namely packages should list the dependencies they link to. Not doing so seems to have caused the issues we experienced with Qt. Though I agree that we should remain vigilant for similar issues.

Luckily it seems that scikit-learn had openblas specified and you found it. We will only know if there are others if people report breakages.

There was no luck involved. I investigated this when packaging scikit-learn and made sure that it was packaged appropriately.

jakirkham commented 8 years ago

So basically we are down to openturns (of things we know of). The PR for it works. Simply a matter of merging it and getting it built at this point. Everything else has been rebuilt.

As mentioned by @ocefpaf, there could be packages that we did not find depending on OpenBLAS that do. Keep your eyes open for loading issues with OpenBLAS. It may indicate such a package is involved. Please let us know so we can help you fix it.

Thanks everyone for your help.

jakirkham commented 8 years ago

openturns was rebuilt earlier today. Considering this resolved.