conda-forge / pytorch-cpu-feedstock

A conda-smithy repository for pytorch-cpu.
BSD 3-Clause "New" or "Revised" License
17 stars 43 forks source link

PyTorch 2.4.0 Package Not Installable w/ CUDA 12 on Python 3.12 Linux x86_64 #254

Open iamthebot opened 2 weeks ago

iamthebot commented 2 weeks ago

Solution to issue cannot be found in the documentation.

Issue

On a Linux x86_64 machine:

CONDA_OVERRIDE_CUDA=12 conda install pytorch
...
Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/alfredo_luque/.airconda-environments/devel--demo--alfredo_luque--airconda_tutorial--v0.0.1

  added / updated specs:
    - pytorch

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fsspec-2024.6.1            |     pyhff2d567_0         130 KB  conda-forge
    numpy-2.1.0                |  py312h1103770_0         8.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         8.1 MB

The following NEW packages will be INSTALLED:

  _sysroot_linux-64~ conda-forge/noarch::_sysroot_linux-64_curr_repodata_hack-3-h69a702a_16
  cuda-version       conda-forge/noarch::cuda-version-11.8-h70ddcb2_3
  cudatoolkit        conda-forge/linux-64::cudatoolkit-11.8.0-h4ba93d1_13
  cudnn              conda-forge/linux-64::cudnn-8.9.7.29-hbc23b4c_3
  filelock           conda-forge/noarch::filelock-3.15.4-pyhd8ed1ab_0
  fsspec             conda-forge/noarch::fsspec-2024.6.1-pyhff2d567_0
  gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2
  gmpy2              conda-forge/linux-64::gmpy2-2.1.5-py312h1d5cde6_1
  icu                conda-forge/linux-64::icu-75.1-he02047a_0
  jinja2             conda-forge/noarch::jinja2-3.1.4-pyhd8ed1ab_0
  kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-h4a8ded7_16
  libabseil          conda-forge/linux-64::libabseil-20240116.2-cxx17_he02047a_1
  libblas            conda-forge/linux-64::libblas-3.9.0-23_linux64_openblas
  libcblas           conda-forge/linux-64::libcblas-3.9.0-23_linux64_openblas
  libgfortran        conda-forge/linux-64::libgfortran-14.1.0-h69a702a_1
  libgfortran-ng     conda-forge/linux-64::libgfortran-ng-14.1.0-h69a702a_1
  libgfortran5       conda-forge/linux-64::libgfortran5-14.1.0-hc5f4f2c_1
  libhwloc           conda-forge/linux-64::libhwloc-2.11.1-default_hecaa2ac_1000
  libiconv           conda-forge/linux-64::libiconv-1.17-hd590300_2
  liblapack          conda-forge/linux-64::liblapack-3.9.0-23_linux64_openblas
  libmagma           conda-forge/linux-64::libmagma-2.8.0-hfdb99dd_0
  libmagma_sparse    conda-forge/linux-64::libmagma_sparse-2.8.0-h9ddd185_0
  libopenblas        conda-forge/linux-64::libopenblas-0.3.27-pthreads_hac2b453_1
  libprotobuf        conda-forge/linux-64::libprotobuf-4.25.3-h08a7969_0
  libstdcxx          conda-forge/linux-64::libstdcxx-14.1.0-hc0a3c3a_1
  libstdcxx-ng       conda-forge/linux-64::libstdcxx-ng-14.1.0-h4852527_1
  libtorch           conda-forge/linux-64::libtorch-2.4.0-cuda118_h8db9d67_301
  libuv              conda-forge/linux-64::libuv-1.48.0-hd590300_0
  libxml2            conda-forge/linux-64::libxml2-2.12.7-he7c6b58_4
  llvm-openmp        conda-forge/linux-64::llvm-openmp-18.1.8-hf5423f3_1
  markupsafe         conda-forge/linux-64::markupsafe-2.1.5-py312h98912ed_0
  mkl                conda-forge/linux-64::mkl-2023.2.0-h84fe81f_50496
  mpc                conda-forge/linux-64::mpc-1.3.1-h24ddda3_0
  mpfr               conda-forge/linux-64::mpfr-4.2.1-h38ae2d0_2
  mpmath             conda-forge/noarch::mpmath-1.3.0-pyhd8ed1ab_0
  nccl               conda-forge/linux-64::nccl-2.22.3.1-hee583db_1
  networkx           conda-forge/noarch::networkx-3.3-pyhd8ed1ab_1
  numpy              conda-forge/linux-64::numpy-2.1.0-py312h1103770_0
  python_abi         conda-forge/linux-64::python_abi-3.12-5_cp312
  pytorch            conda-forge/linux-64::pytorch-2.4.0-cuda118_py312h3690e1b_301
  sleef              conda-forge/linux-64::sleef-3.6.1-h1b44611_3
  sympy              conda-forge/noarch::sympy-1.13.2-pypyh2585a3b_103
  sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h4a8ded7_16
  tbb                conda-forge/linux-64::tbb-2021.12.0-h434a139_3
  typing_extensions  conda-forge/noarch::typing_extensions-4.12.2-pyha770c72_0
  zstd               conda-forge/linux-64::zstd-1.5.6-ha6fb4c9_0

The following packages will be DOWNGRADED:

  _openmp_mutex                                   4.5-2_gnu --> 4.5-2_kmp_llvm

Proceed ([y]/n)?

Interestingly, the CUDA 11.8 variant is picked when using this solve. I ran this using the libmamba solver but it's also an issue with the classic solver (which ends up ignoring CONDA_OVERRIDE_CUDA and picks the cpu_generic_py312 variant).

2.3.1 does not have this issue. That is, if I run CONDA_OVERRIDE_CUDA=12 conda install "pytorch<2.4.0" I get a CUDA 12 version of PyTorch in the solve.

Installed packages

# packages in environment at /home/alfredo_luque/.airconda-environments/devel--demo--alfredo_luque--airconda_tutorial--v0.0.1:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2024.7.4             hbcca054_0    conda-forge
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc                    14.1.0               h77fa898_1    conda-forge
libgcc-ng                 14.1.0               h69a702a_1    conda-forge
libgomp                   14.1.0               h77fa898_1    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
ncurses                   6.5                  he02047a_1    conda-forge
openssl                   3.3.1                hb9d3cd8_3    conda-forge
pip                       24.2               pyhd8ed1ab_0    conda-forge
python                    3.12.5          h2ad013b_0_cpython    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                72.2.0             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2024a                h8827d51_1    conda-forge
wheel                     0.44.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

Environment info

active environment : devel--demo--alfredo_luque--airconda_tutorial--v0.0.1
    active env location : /home/alfredo_luque/.airconda-environments/devel--demo--alfredo_luque--airconda_tutorial--v0.0.1
            shell level : 2
       user config file : /home/alfredo_luque/.condarc
 populated config files : /opt/conda/.condarc
                          /home/alfredo_luque/.condarc
          conda version : 24.7.1
    conda-build version : 24.5.1
         python version : 3.10.14.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=zen2
                          __conda=24.7.1=0
                          __cuda=12.4=0
                          __glibc=2.35=0
                          __linux=5.15.149=0
                          __unix=0=0
       base environment : /opt/conda  (writable)
      conda av data dir : /opt/conda/etc/conda
  conda av metadata url : None
           channel URLs : https://artifactory.d.musta.ch/artifactory/api/conda/conda-airbnb/linux-64
                          https://artifactory.d.musta.ch/artifactory/api/conda/conda-airbnb/noarch
                          https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /opt/conda/pkgs
                          /home/alfredo_luque/.conda/pkgs
       envs directories : /home/alfredo_luque/.airconda-environments
                          /opt/conda/envs
                          /home/alfredo_luque/.conda/envs
               platform : linux-64
             user-agent : conda/24.7.1 requests/2.32.3 CPython/3.10.14 Linux/5.15.149-99.162.amzn2.x86_64 ubuntu/22.04.4 glibc/2.35 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 7331:7331
             netrc file : None
           offline mode : False
hmaarrfk commented 2 weeks ago

I think it might be because our builds stalled...

hmaarrfk commented 2 weeks ago

image

hmaarrfk commented 2 weeks ago

I expect it to take like 13 hours. Please check and report! thanks!

iamthebot commented 2 weeks ago

I expect it to take like 13 hours. Please check and report! thanks!

No problem, thanks for the quick response! Will test tomorrow.

jakirkham commented 2 weeks ago

Thanks Mark! 🙏

Looks like one failed. Unfortunately this appears to be after the build, but during the conda-build DSO checking phase

Are these kinds of CI issue common here? If so, what things would you recommend (say to a provider) to address the reliability issues?

hmaarrfk commented 2 weeks ago

but during the conda-build DSO checking phase

not sure if that is true, the other seemed to have failed during hte building phase.

I had to restart the aarch64 jobs.

jakirkham commented 2 weeks ago

That was what the last part of the log that I could see in GitHub last night. Perhaps they had trouble loading? The log files are quite long

Looking today using the raw log to get them to load fully (attached in compressed form below to meet size limitations), am seeing the following in those jobs


From the CUDA 12 Linux ARM job ( attached compressed log ):

+ python -c 'import torch; torch.tensor(1).to('\''cpu'\'').numpy(); print('\''numpy support enabled!!!'\'')'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/conda/feedstock_root/build_artifacts/libtorch_1724888760332/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.8/site-packages/torch/__init__.py", line 290, in <module>
    from torch._C import *  # noqa: F403
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/conda/feedstock_root/build_artifacts/libtorch_1724888760332/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.8/site-packages/torch/../../.././libcurand.so.10)

Unfortunately some CUDA libraries moving to EL8: https://github.com/conda-forge/cuda-feedstock/issues/28

So to run this test we likely need to use the AlmaLinux 8 image. An example would be PR: https://github.com/conda-forge/faiss-split-feedstock/pull/75

Alternatively we could just skip this test on CUDA ARM. Presumably if the CPU one passes, this is a pretty good indication of whether this one will pass


From the CPU-only Linux ARM job ( attached compressed log ):

+ python -c 'import torch; torch.tensor(1).to('\''cpu'\'').numpy(); print('\''numpy support enabled!!!'\'')'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: PyTorch was compiled without NumPy support

Though it looks like the CPU ARM test doesn't pass atm. Think you understand this better than I. Guessing we need to broaden this workaround to cover ARM: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/252 ?

hmaarrfk commented 2 weeks ago

Think you understand this better than I. Guessing we need to broaden this workaround to cover ARM: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/252 ?

I'm glad my test worked....

jcwomack commented 2 weeks ago

Possibly relevant: We encountered a "PyTorch was compiled without NumPy support" error when running on Linux aarch64 + CUDA (on NVIDIA GH200) using the conda-forge build of PyTorch 2.4.0.

Relevant output from conda list for the environment in which the error was encountered:

pytorch                   2.4.0           cuda120_py312haadfe8f_200    conda-forge
pytorch-gpu               2.4.0           cuda120py312hecaec72_200    conda-forge

Rolling back to 2.3.0 remedied this issue. Looking at the build number, it seems that the build we installed preceded merging PR #252.

jakirkham commented 2 weeks ago

Thanks James! Yep this is expected

In PR ( https://github.com/conda-forge/pytorch-cpu-feedstock/pull/252 ), Mark worked around a bug in CMake to fix ensure PyTorch builds with NumPy and tested it in the recipe. These packages would show up with a build/number of 201 (instead of 200 as your example shows). CMake has since also integrated a fix, but it is not yet released

As noted above ( https://github.com/conda-forge/pytorch-cpu-feedstock/issues/254#issuecomment-2318432967 ), this test appears to be working correctly. However it shows that the Linux ARM builds are failing. So no packages are available with build/number of 201 yet. So we may need to extend Mark's workaround for Linux ARM

Am guessing fixing this would be taking this code

https://github.com/conda-forge/pytorch-cpu-feedstock/blob/6dd85b3f85a72371b2d3ccf6a386de67e61d667e/recipe/meta.yaml#L100-L101

...and changing it like so...

-    - cmake !=3.30.0,!=3.30.1,!=3.30.2        # [osx and blas_impl == "mkl"]
-    - cmake                                   # [not (osx and blas_impl == "mkl")]
+    - cmake !=3.30.0,!=3.30.1,!=3.30.2        # [unix]
+    - cmake                                   # [not unix]

@jcwomack is this something you would be willing to try in a new PR? 🙂

jcwomack commented 2 weeks ago

Hi @jakirkham, thanks for the quick response!

Apologies, but I've got quite limited availability for the next week or so, so would not be able to work on a PR myself at this time.