Open iamthebot opened 2 weeks ago
I think it might be because our builds stalled...
I expect it to take like 13 hours. Please check and report! thanks!
I expect it to take like 13 hours. Please check and report! thanks!
No problem, thanks for the quick response! Will test tomorrow.
Thanks Mark! 🙏
Looks like one failed. Unfortunately this appears to be after the build, but during the conda-build DSO checking phase
Are these kinds of CI issue common here? If so, what things would you recommend (say to a provider) to address the reliability issues?
but during the conda-build DSO checking phase
not sure if that is true, the other seemed to have failed during hte building phase.
I had to restart the aarch64 jobs.
That was what the last part of the log that I could see in GitHub last night. Perhaps they had trouble loading? The log files are quite long
Looking today using the raw log to get them to load fully (attached in compressed form below to meet size limitations), am seeing the following in those jobs
From the CUDA 12 Linux ARM job ( attached compressed log ):
+ python -c 'import torch; torch.tensor(1).to('\''cpu'\'').numpy(); print('\''numpy support enabled!!!'\'')'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/conda/feedstock_root/build_artifacts/libtorch_1724888760332/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.8/site-packages/torch/__init__.py", line 290, in <module>
from torch._C import * # noqa: F403
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/conda/feedstock_root/build_artifacts/libtorch_1724888760332/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.8/site-packages/torch/../../.././libcurand.so.10)
Unfortunately some CUDA libraries moving to EL8: https://github.com/conda-forge/cuda-feedstock/issues/28
So to run this test we likely need to use the AlmaLinux 8 image. An example would be PR: https://github.com/conda-forge/faiss-split-feedstock/pull/75
Alternatively we could just skip this test on CUDA ARM. Presumably if the CPU one passes, this is a pretty good indication of whether this one will pass
From the CPU-only Linux ARM job ( attached compressed log ):
+ python -c 'import torch; torch.tensor(1).to('\''cpu'\'').numpy(); print('\''numpy support enabled!!!'\'')'
Traceback (most recent call last):
File "<string>", line 1, in <module>
RuntimeError: PyTorch was compiled without NumPy support
Though it looks like the CPU ARM test doesn't pass atm. Think you understand this better than I. Guessing we need to broaden this workaround to cover ARM: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/252 ?
Think you understand this better than I. Guessing we need to broaden this workaround to cover ARM: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/252 ?
I'm glad my test worked....
Possibly relevant: We encountered a "PyTorch was compiled without NumPy support" error when running on Linux aarch64 + CUDA (on NVIDIA GH200) using the conda-forge build of PyTorch 2.4.0.
Relevant output from conda list
for the environment in which the error was encountered:
pytorch 2.4.0 cuda120_py312haadfe8f_200 conda-forge
pytorch-gpu 2.4.0 cuda120py312hecaec72_200 conda-forge
Rolling back to 2.3.0 remedied this issue. Looking at the build number, it seems that the build we installed preceded merging PR #252.
Thanks James! Yep this is expected
In PR ( https://github.com/conda-forge/pytorch-cpu-feedstock/pull/252 ), Mark worked around a bug in CMake to fix ensure PyTorch builds with NumPy and tested it in the recipe. These packages would show up with a build/number
of 201
(instead of 200
as your example shows). CMake has since also integrated a fix, but it is not yet released
As noted above ( https://github.com/conda-forge/pytorch-cpu-feedstock/issues/254#issuecomment-2318432967 ), this test appears to be working correctly. However it shows that the Linux ARM builds are failing. So no packages are available with build/number
of 201
yet. So we may need to extend Mark's workaround for Linux ARM
Am guessing fixing this would be taking this code
...and changing it like so...
- - cmake !=3.30.0,!=3.30.1,!=3.30.2 # [osx and blas_impl == "mkl"]
- - cmake # [not (osx and blas_impl == "mkl")]
+ - cmake !=3.30.0,!=3.30.1,!=3.30.2 # [unix]
+ - cmake # [not unix]
@jcwomack is this something you would be willing to try in a new PR? 🙂
Hi @jakirkham, thanks for the quick response!
Apologies, but I've got quite limited availability for the next week or so, so would not be able to work on a PR myself at this time.
Solution to issue cannot be found in the documentation.
Issue
On a Linux x86_64 machine:
Interestingly, the CUDA 11.8 variant is picked when using this solve. I ran this using the
libmamba
solver but it's also an issue with the classicsolver
(which ends up ignoringCONDA_OVERRIDE_CUDA
and picks the cpu_generic_py312 variant).2.3.1 does not have this issue. That is, if I run
CONDA_OVERRIDE_CUDA=12 conda install "pytorch<2.4.0"
I get a CUDA 12 version of PyTorch in the solve.Installed packages
Environment info