Closed peastman closed 2 years ago
It's in openblas
package along with the headers.
I'm trying to track down a build failure. Perhaps you have insight into what the real problem is? My library uses PyTorch and links to libtorch.dylib. It does not directly use OpenBLAS, but apparently PyTorch does at some level, and I get an error trying to configure with CMake, telling me that libopenblas.dylib doesn't exist. Installing the PyTorch package installs the libopenblas package but not the openblas package. So what is the real problem?
One way or another, the current behavior is that if you install the PyTorch conda package and then use CMake to build something that links to libtorch, you get an error. What is the correct place to fix this?
What's the error exactly?
make[2]: *** No rule to make target '/Users/peastman/miniconda3/envs/openmm/lib/libopenblas.dylib', needed by 'libOpenMMTorch.dylib'. Stop.
make[1]: *** [CMakeFiles/Makefile2:190: CMakeFiles/OpenMMTorch.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
Does anyone have insight on what the correct fix is? I don't know whether the change should be in PyTorch, OpenBLAS, or CMake, but something needs to be changed somewhere.
See https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1058528597 above. If a user simply installs PyTorch and tries to compile against it, they get a compilation error. The package includes libtorch, which contains the C++ API and is designed to be a library users link their own code to. Using the package in the intended way leads to errors. The fact that it uses OpenBLAS is an internal implementation detail of PyTorch, exactly the sort of thing conda is supposed to take care of for you.
Is there any good reason not to include the symlink in libopenblas? Omitting it creates problems for users. The version of the package on the defaults channel includes it.
Can anyone else comment on this? The problem is still not resolved. I believe the solution I originally proposed (including the alias in libopenblas
) is the correct solution. I'm open to other possibilities, but no one has suggested a better one or pointed out any downside to including the alias. It's a tiny change that saves a lot of problems for users and makes the conda-forge version of the package consistent with the default channel version.
cc @jakirkham @h-vetinari
Is there any good reason not to include the symlink in libopenblas?
Yes, to prevent downstream packages from accidentally linking to them. The symlink and headers are in the development package just like debian, fedora and countless other package managers. Are you suggesting that debian should package the symlink and the headers in the non dev package?
For eg: see the difference between https://packages.debian.org/stretch/amd64/libopenblas-base/filelist and https://packages.debian.org/stretch/amd64/libopenblas-dev/filelist
Again, I'm happy to consider other options. But you still haven't suggested any solution at all to the problem. Let me repeat one more time exactly what the steps are.
pytorch
package from CondaForge.libopenblas.dylib
can't be found.Notice that the user has not installed OpenBLAS. Until they hit this error message, they didn't even know that PyTorch used OpenBLAS. They certainly don't know there are two different packages related to OpenBLAS. Further, this error only happens with recent versions of PyTorch. With older versions the above steps work perfectly, because they require an older version of OpenBLAS than is available on CondaForge. Instead conda installs a package from the defaults channel which does include the alias.
Are you suggesting that debian should package the symlink and the headers in the non dev package?
What does this have to do with Debian? This is an error on Mac.
But you still haven't suggested any solution at all to the problem.
Ask pytorch-cpu-feedstock maintainers to patch the cmake file. It should not be necessary to link to libopenblas.dylib.
What does this have to do with Debian? This is an error on Mac.
It's not unique to Mac. This doesn't manifest in Linux x86, because pytorch uses MKL in Linux x86.
Anyways, this is a pytorch-cpu-feedstock issue and not this feedstock.
Ask pytorch-cpu-feedstock maintainers to patch the cmake file.
This has nothing to do with a PyTorch CMake file. PyTorch simply calls find_package(BLAS)
. CMake has built in support for finding a variety of BLAS implementations. When it finds OpenBLAS on Mac it expects there to be a library called libopenblas.dylib
. Every other way of installing OpenBLAS includes an alias with that name. The CondaForge package is unique in not including it. It is the one that is not following the convention. I already tried raising an issue with the CMake team and was explicitly told, "This is a bug in the conda package."
I think the issue is that pytorch has a host-dependency on openblas
(hence, with the development headers, library alias, etc.), but the openblas output doesn't have it's own run_export (and therefore just inherits the one from libopenblas
).
I think this setup is actually correct from the openblas side, but it sounds like the pytorch feedstock should add a runtime-dependency on openblas
; or alternatively, add a devel
package (that's suitable for being compiled against) that includes it.
CC @conda-forge/pytorch-cpu
Can any of the PyTorch package maintainers comment on this?
I think @h-vetinari is correct. PRs welcome.
No, @h-vetinari is not correct. The correct way is to patch pytorch's cmake files to remove the dependence on libopenblas.dylib
.
Or make pytorch link to the blas library as PRIVATE
instead of PUBLIC
.
The correct way is to patch pytorch's cmake files to remove the dependence on libopenblas.dylib.
Why do you keep saying that? PyTorch's CMake files do not refer to libopenblas. All it does is call find_package(BLAS)
, and CMake locates a BLAS implementation for it.
All it does is call find_package(BLAS), and CMake locates a BLAS implementation for it.
Yes, but it links to BLAS as a public dependency instead of a private dependency. See https://github.com/pytorch/pytorch/blob/e5bf87963d65d6273a6ec3505a50c07229643606/cmake/Dependencies.cmake#L194. I suggest you start reading https://cmake.org/cmake/help/latest/command/target_link_libraries.html
Please read up on how CMake target files work and have a look at /lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Targets.cmake
file.
PyTorch's CMake files do not refer to libopenblas.
This is clearly wrong. Have a look at lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Targets.cmake
and also read https://cmake.org/cmake/help/latest/command/export.html#exporting-targets to get an understanding.
@hmaarrfk what do you think is the right solution? I'm happy to make a PR. I just need to know what it should change.
I think isuruf is right (he has very good instincts on this) .
I think you have to hunt down a cmake file that has the word PUBLIC in it where it shouldn't.
Not very easy to find.
When it comes to this kind of hunting, git grep
does a lot better that GitHub search IME
Don't we have the package you are looking for
https://github.com/conda-forge/openmm-feedstock
Can you follow that recipe?
Peter maintains that feedstock IIUC. The issue is PyTorch (upstream) has something wonky in its CMake files AIUI
I've been looking into implementing this. The more I investigate it, the less comfortable I am with the proposed solution.
Here is the relevant PyTorch CMake script. It builds two lists called Caffe2_DEPENDENCY_LIBS
and Caffe2_PUBLIC_DEPENDENCY_LIBS
. Later in the build process, a different script links anything added to the former as private and anything added to the latter as public. As part of building the list, it looks for many different BLAS implementations and whichever one it finds, it adds the relevant library to Caffe2_PUBLIC_DEPENDENCY_LIBS
. This works for all implementations other than OpenBLAS. For example, the MKL conda package includes aliases for all its libraries with no need to install a separate dev package. And of course it also works for any OpenBLAS package other than the CondaForge one.
I could create a patch to the script so that OpenBLAS_LIB
would get added to Caffe2_DEPENDENCY_LIBS
instead of Caffe2_PUBLIC_DEPENDENCY_LIBS
. That might work, though I haven't tested it. But it raises a number of issues.
Instead we could trivially and robustly fix the problem by just adding one alias to the OpenBLAS package which is already present in other distributions of OpenBLAS, including the corresponding package on the conda defaults channel.
Do you agree with my assessment?
@hmaarrfk do you agree?
I can't respond at this moment, but if you think there is something wrong with pytorch the build on OSX, can you open an issue there and copy your assessment?
Generally speaking, people don't test their build systems in something exactly like conda. The state of the parent machine might be making a bug in their build system.
It wouldn't be the first time we propose patches to upstream projects improving their build system.
if you think there is something wrong with pytorch the build on OSX, can you open an issue there and copy your assessment?
I don't think there's anything wrong with pytorch. I think the problem is with OpenBLAS. See my analysis above. Can we please reopen this issue?
I'll reopen, but isuruf's original answer seems correct:
mamba install openblas
should address your challenges.
Correct in what sense? Can someone please respond to the analysis I posted above? I carefully laid out why trying to patch the PyTorch build script was a bad solution, and why the trivial change I originally suggested is both more robust and more consistent with the rest of the ecosystem. That was over three weeks ago, and not a single person has responded to anything I said.
This works for all implementations other than OpenBLAS. For example, the MKL conda package includes aliases for all its libraries with no need to install a separate dev package.
It could also be a cmake issue that it's not looking for a versioned libopenblas. MKL similarly has a devel package in conda-forge as well that's very similar to the openblas one.
I don't think the patch is as fragile as you say, and maybe it's even acceptable upstream.
All that said, I don't feel super strongly about all this, but with Isuru's strong opposition to changing this (and his experience & standing on these topics), it looks like you're fighting an uphill battle on this. I'd recommend you try the pytorch patch, and perhaps submit it upstream (or figure out why cmake is not picking up on the libopenblas that's present).
But the library alias should exist. In your analysis you assume it doesn't. However, osx+arm is currently depending at build time on both openblas
and libopenblas
Your assumption that the alias doesn't exist is wrong.
Isn't it?
If not please add a quick test to our CI befofe the cmake command
ls -lah $PREFIX/lib
test -f $PREFIX/lib/libopenblas.dynlib
From the osx+arm+python 3.9 log:
2022-04-27T01:48:33.5362930Z -- Found OpenBLAS libraries: /Users/runner/miniforge3/conda-bld/pytorch-recipe_1651023559589/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/lib/libopenblas.dylib
or for ease of reading
2022-04-27T01:48:33.5362930Z -- Found OpenBLAS libraries: $PREFIX/lib/libopenblas.dylib
Your assumption that the alias doesn't exist is wrong.
Isn't it?
Download the package and see for yourself. I'm not talking about what happens while building the PyTorch package. I'm talking about what happens when you install PyTorch to use its C++ API, and try to compile your own code that links to it:
conda install pytorch
You're somehow supposed to figure out on your own that there exists a package called openblas
, that it's different from the libopenblas
package that was automatically installed with PyTorch, somehow track down what the difference between the two packages is (the names give no clue), and finally realize that you need to manually install that second package if you want PyTorch to work correctly.
with Isuru's strong opposition to changing this (and his experience & standing on these topics), it looks like you're fighting an uphill battle on this.
Isuru has not offered a single reason why it would be harmful to add the alias, nor addressed the fact that the alias is present in other comparable packages (such as MKL, or the defaults channel version of libopenblas
), nor pointed out any advantage of his suggested change which would be far more complicated, far more work, and far more fragile. In fact, he hasn't replied to anything I've said for nearly two months.
Unfortunately, i really don't have an Mac M1. I am unable to download and see for myself. A log file showing your error would go a long way.
Generally speaking: a one line patch to pytorch is not "hacking" and would likely get accepted upstream. Generally speaking: from an engineering standpoint, upstream isn't always interested in "integration in an ecosystem" as a first priority. So we often have to do small fixes like one line patches to build systems to get them to work with conda. Generally speaking: it is common to ship the bare .so files in a different package to ensure that software doesn't accendentallg link to them. This seems to be true even in other distributions like ubuntu or Fedora where devel or dev packages exist.
We have also worked alot with different upstream ~package providers~ software authors to help them improve their build systems.
They are, generally speaking, really interested in getting people to use their software. They just don't really test for "conda" at first.
Edit: package provider -> software authors
I just don't understand where you're coming from. Can you give one single disadvantage to including the alias? Is there any possible use case where it would cause problems? If not, then why is there any disagreement?
And let me emphasize again, this package is the outlier that's violating the conventions of the rest of the ecosystem. If you install the same package from the defaults channel, it includes the alias. If you install a different BLAS from conda-forge, such as MKL, it includes all necessary aliases. But this one doesn't, and that omission causes major problems for users.
Unfortunately, i really don't have an Mac M1. I am unable to download and see for myself.
Just download https://anaconda.org/conda-forge/libopenblas/0.3.20/download/osx-64/libopenblas-0.3.20-openmp_hb3cd9ec_0.tar.bz2. Expand it and see for yourself that it contains a library called libopenblas.0.dylib
, but not an alias called libopenblas.dylib
pointing to it.
Now download https://anaconda.org/anaconda/libopenblas/0.3.20/download/osx-64/libopenblas-0.3.20-h9a5756b_1.tar.bz2, which is the exact same version of the same package from the defaults channel, and see that it does include the alias.
A log file showing your error would go a long way.
See https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1058681071 above.
I just don't understand where you're coming from. Can you give one single disadvantage to including the alias? Is there any possible use case where it would cause problems?
You probably missed https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1083482954
I saw it. It just doesn't make any sense. No downstream package will link to it unless it's specifically trying to link to it. Linking to a library is not something that happens "accidentally". Every library you link to has to be specified by name. And let me emphasize yet again, this is how every other comparable package works. Including the same package from the default channel. Including other BLAS implementations from conda-forge.
Let me try to recap in a constructive way.
You asked for help, from volunteers, without providing any logs claiming that something was broken. Instead you pointed to how "defaults" does it.
There are reasons why you are choosing conda-forge. Some of the design decisions behind how conda-forge is built is the reason why you find this attractive.
Now, you had 3 core members of conda-forge respond to you in a consistent manner.
In fact, two of those are actively involved in building large packages like pytorch.
In https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1093084389 and https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1093142065 isuru pointed you exactly to the line of code that you would have to suggest changing upstream, and for conda-forge that would address your issues.
I, a maintainer of the pytorch feedstock, would personally accept such a patch, and help you justify it upstream.
We aren't saying we are infallible, but in this case, you got some pretty direct feedback about how to solve this particular issue. Please try to make an attempt to resolve it.
@hmaarrfk I've been trying very hard to be constructive through this whole discussion, but I'm also finding it incredibly frustrating. It began with Isuru basically dismissing my suggestions with a vague reference to Debian, an argument that made no sense and displayed a lack of understanding of the deep differences between the dependency models in Linux and Python. I nonetheless made a sincere effort to implement the solution he suggested, but quickly concluded it just didn't make sense. I wrote up a technical discussion of the problems with it, which was met with complete silence. It took over three weeks and multiple further posts to get anyone to reply, and when they did, I was basically told, "Isuru doesn't like it so it isn't going to happen." Meanwhile, Isuru has not replied to a single thing I've posted for two months, which makes a productive discussion entirely impossible.
I am 100% willing to do the work to fix the problem. But no one seems interested in what is obviously the correct fix, instead referring back to the earlier suggestion which has serious problems that I already explained in detail. Every time I try to discuss the technical reasons for preferring it, I am met with silence. Looking back over the discussion, I count no less than four times when that happened.
You asked for help, from volunteers, without providing any logs claiming that something was broken.
I did post it, over two months ago. And I provided another link back to it just yesterday.
I am trying really, really hard to fix a serious problem that is impacting my work. I am willing to do the work to fix it. If you say you're willing to accept a PR adding the missing alias, I'll do it immediately and we can close this issue as resolved. But instead I feel I'm just getting ignored, and it's very upsetting.
I'm not really sure if your assessment of the timeline is correct. isuruf replied in a very timely manner at the beginning of the issue.
isuruf likely has the most complete understanding of compilers, conda, and conda-forge of those i've interacted with.
It began with Isuru basically https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1083482954 with a vague reference to Debian
The internet can be dry at times. The balance between "brief" answers and "basically dismissing" is thin. However, he promptly replied to questions.
It took over three weeks and multiple further posts to get anyone to reply
Let me justify why I didn't reply. When asked for additional information, you provided 3 lines of a log. Often it is really important how you setup your information. When asked for the "exact" problem, often it is really valuable to provide full CMake Logs. We could have been more explicit about this request, but the level of experience compiling things yourself often indicates that you are comfortable communicating programming with fully reproducible logs.
I was basically told, "Isuru doesn't like it so it isn't going to happen."
Again. many of us trust isuru's judgement on these things.
Isuru has not replied to a single thing I've posted for two months, which makes a productive discussion entirely impossible
Again, nobody has to reply. if somebody feels like they cannot add anything constructive to the discussion, they often refrain from typing.
obviously the correct fix
I don't think any solution obvious. We often work collaboratively asking others for input and feedback.
I did https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1058681071, over two months ago. And I provided https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1146066152 just yesterday.
These logs are seriously incomplete. For an example on how to ask for external help, please see https://github.com/google/highway/issues/726 where I included as an attachment, log information going all the way back to early cmake commands and the installation of dependencies. These are hard to make without CIs, but you can likely start at when you type
mkdir build
cd build
cmake ..
[...]
and provide us all that information.
I am trying really, really hard to fix a serious problem that is impacting my work.
As much as we want to help, we often won't be fast enough for a mission critical work. Its really up to you, as an engineer, to decide on the risk factors you want to choose and how you want to build your software dependencies.
The infrastructure at conda-forge is flexible enough to allow you to build and upload your own packages. This REALLY helps decouple problems.
If you say you're willing to accept a PR adding the missing alias,
I don't think I can accept a PR to openblas-feedstock since those involved in maintaining it expressed that they do not agree with the assessment that any files are missing from the released packages.
But instead I feel I'm just getting ignored
You've been pointed in the right direction in https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1093142065
For what its worth, isuru's comments lead me directly to a small patch: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/116
which I've asked for feedback from upstream developers too https://github.com/pytorch/pytorch/pull/78883
In summary:
As a maintainer of the pytorch feedstack, and having studied the suggestions made by isuru, i agree that it is a bug in pytorch, and thus, in the pytorch feedstock. I'm working with other maintainers to address it. And thus I'm going to close this issue.
It may take a few weeks to finish the rebuild given how difficult it is to build pytorch on CIs.
When asked for additional information, you provided 3 lines of a log.
What I provided is exactly what I was asked for. Isuru's exact request was, "What's the error exactly?" I provided the exact error. At no point in the entire discussion did anyone ever say, "Your error message was insufficient, please post more details."
I'm not really sure if your assessment of the timeline is correct. isuruf replied in a very timely manner at the beginning of the issue.
I posted the above on May 3. More than two weeks later, on May 18, still no one had replied. I therefore posted again asking for any insight--keeping in mind I had provided all the information I had been asked for up to that point. He replied with a two word message, "See #134", linking back to an earlier one sentence message from before he asked me for the error and before I provided it.
Let me justify why I didn't reply. When asked for additional information, you provided 3 lines of a log.
Tell me precisely what information you want, and I'll be glad to provide it.
Again. many of us trust isuru's judgement on these things.
If he were participating in this discussion that might be a valid point. However, he has not said a single thing in two months, since before I posted a detailed analysis of why his suggested solution was not correct. That was nearly a month ago, and still not a single person has responded to a single one of the technical issues I raised. Not one. There is no discussion going on here. I keep trying to discuss technical issues, and all my attempts are completely ignored. If Isuru wants to join in the technical discussion, he's welcome to. If he doesn't want to that's his choice too. But his choice to ignore the discussion does not mean anything has been resolved.
You've been pointed in the right direction in https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1093142065
And I replied with https://github.com/conda-forge/openblas-feedstock/issues/134#issuecomment-1121489856 explaining why that was not, in fact, the right direction. Do you agree with the technical points I raised? Disagree? If you disagree, please say so and explain why. If you agree, please say that.
For what its worth, isuru's comments lead me directly to a small patch:
I strongly recommend against that patch. It is not the right solution. It is based on a misunderstanding of the problem. I've already explained the reasons why. I could keep repeating them, or expand on them further if you want to understand the actual issues here. What you're doing there is a band-aid to paper over the fact that the OpenBLAS package is broken and violates the conventions of the Python ecosystem. In the process, it changes the binary interface for libtorch which risks breaking downstream software.
- This changes the binary interface for libtorch, which risks breaking other software that uses it.
I could keep repeating them, or expand on them further if you want to understand the actual issues here.
Could you explain this binary interface part to me? (As in, explaining how it would change it and thus risk breaking software using it)
--
And, if you want an even more trivial solution, you could follow the very first response in the issue:
It's in
openblas
package along with the headers.
And, if you want an even more trivial solution, you could follow the very first response in the issue:
It's in openblas package along with the headers.
That's the problem. it's not the solution!
Could you explain this binary interface part to me? (As in, explaining how it would change it and thus risk breaking software using it)
Consider a downstream C++ code that links to libtorch, and also invokes BLAS routines directly. PyTorch supports many different BLAS implementations, and ensures that an appropriate one will be installed. Its build scripts contain lots of logic to locate different implementations and link to them correctly. You could repeat all that work in your own build system, but why do that when PyTorch has already done it for you? When you link to libtorch, that's guaranteed to bring in an appropriate BLAS that you can invoke with no further effort.
The proposed change would break that. It changes BLAS from public to private. Any package that expects it to be public would no longer compile.
Solution to issue cannot be found in the documentation.
Issue
The Mac packages distributed through the default channel include an alias with the name
libopenblas.dylib
pointing to the library. That allows it to be referenced with standard naming. The conda-forge packages are missing that alias. Instead you have to reference it with the namelibopenblas.0.dylib
. This causes problems for software that expects the standard name to work. See, for example, https://github.com/numpy/numpy/issues/14165 and https://gitlab.kitware.com/cmake/cmake/-/issues/23291.Would it be possible to add the missing alias?
Installed packages
Environment info