aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.83k stars 551 forks source link

installation issue with cuda 12 #494

Closed blakemertz closed 3 weeks ago

blakemertz commented 1 month ago

I have tried several permutations to get openfold to install on my local machine, but no joy up to this point. Could use some help, as I need to install openfold as a dependency for a couple of other codes (in particular DiffDock-L). Here is my GPU, driver, and cuda:

nvidia-smi 
Mon Oct 14 16:54:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   41C    P8              15W /  80W |     59MiB /  6144MiB |     14%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1382      G   /usr/lib/xorg/Xorg                           55MiB |
+---------------------------------------------------------------------------------------+

My v12 of gcc/g++/gfortran on my OS is 12.4 -- I believe that 12.2 is the highest version supported by cuda 12.1/2, but 12.4 is what is included in my Debian testing repos.

My packages for the openfold environment, pulled from the pl_upgrades branch to be able to utilize pytorch v2 and cuda 12:

conda list
# packages in environment at /media/Data/binaries/miniconda3/envs/openfold:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
absl-py                   2.1.0              pyhd8ed1ab_0    conda-forge
annotated-types           0.7.0                    pypi_0    pypi
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
aria2                     1.37.0               hbc8128a_2    conda-forge
aws-c-auth                0.7.26               hc36b679_2    conda-forge
aws-c-cal                 0.7.4                h2abdd08_0    conda-forge
aws-c-common              0.9.27               h4bc722e_0    conda-forge
aws-c-compression         0.2.19               haa50ccc_0    conda-forge
aws-c-event-stream        0.4.3                h570d160_0    conda-forge
aws-c-http                0.8.8                h9b61739_1    conda-forge
aws-c-io                  0.14.18              h49c7fd3_7    conda-forge
aws-c-mqtt                0.10.4              h5c8269d_18    conda-forge
aws-c-s3                  0.6.4               h77088c0_11    conda-forge
aws-c-sdkutils            0.1.19               h038f3f9_2    conda-forge
aws-checksums             0.1.18              h038f3f9_10    conda-forge
awscli                    2.18.3          py310hff52083_0    conda-forge
awscrt                    0.21.2          py310h95a9d59_15    conda-forge
biopython                 1.84            py310hc51659f_0    conda-forge
blas                      2.116                       mkl    conda-forge
blas-devel                3.9.0            16_linux64_mkl    conda-forge
brotli-python             1.1.0           py310hc6cd4ac_1    conda-forge
bzip2                     1.0.8                h4bc722e_7    conda-forge
c-ares                    1.33.1               heb4867d_0    conda-forge
ca-certificates           2024.8.30            hbcca054_0    conda-forge
certifi                   2024.8.30          pyhd8ed1ab_0    conda-forge
cffi                      1.17.0          py310h2fdcea3_0    conda-forge
charset-normalizer        3.4.0              pyhd8ed1ab_0    conda-forge
click                     8.1.7           unix_pyh707e725_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
contextlib2               21.6.0             pyhd8ed1ab_0    conda-forge
cryptography              40.0.2          py310h34c0648_0    conda-forge
cuda-cudart               12.1.105                      0    nvidia
cuda-cupti                12.1.105                      0    nvidia
cuda-libraries            12.1.0                        0    nvidia
cuda-nvrtc                12.1.105                      0    nvidia
cuda-nvtx                 12.1.105                      0    nvidia
cuda-opencl               12.4.127                      0    nvidia
cuda-runtime              12.1.0                        0    nvidia
cudatoolkit               11.8.0              h4ba93d1_13    conda-forge
deepspeed                 0.12.4                   pypi_0    pypi
distro                    1.8.0              pyhd8ed1ab_0    conda-forge
dllogger                  1.0.0                    pypi_0    pypi
dm-tree                   0.1.6                    pypi_0    pypi
docker-pycreds            0.4.0                      py_0    conda-forge
docutils                  0.19            py310hff52083_1    conda-forge
einops                    0.8.0                    pypi_0    pypi
fftw                      3.3.10          nompi_hf1063bd_110    conda-forge
filelock                  3.16.1             pyhd8ed1ab_0    conda-forge
flash-attn                2.6.3                    pypi_0    pypi
fsspec                    2024.9.0           pyhff2d567_0    conda-forge
git                       2.46.0          pl5321hb5640b7_0    conda-forge
gitdb                     4.0.11             pyhd8ed1ab_0    conda-forge
gitpython                 3.1.43             pyhd8ed1ab_0    conda-forge
gmp                       6.3.0                hac33072_2    conda-forge
gmpy2                     2.1.5           py310hc7909c9_1    conda-forge
hhsuite                   3.3.0           py310pl5321hc31ed2c_12    bioconda
hjson                     3.1.0                    pypi_0    pypi
hmmer                     3.4                  hdbdd923_2    bioconda
icu                       75.1                 he02047a_0    conda-forge
idna                      3.10               pyhd8ed1ab_0    conda-forge
ihm                       1.3             py310h5b4e0ec_0    conda-forge
jinja2                    3.1.4              pyhd8ed1ab_0    conda-forge
jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
kalign2                   2.04                 h031d066_7    bioconda
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.21.3               h659f571_0    conda-forge
ld_impl_linux-64          2.43                 h712a8e2_1    conda-forge
libabseil                 20240116.2      cxx17_he02047a_1    conda-forge
libblas                   3.9.0            16_linux64_mkl    conda-forge
libcblas                  3.9.0            16_linux64_mkl    conda-forge
libcublas                 12.1.0.26                     0    nvidia
libcufft                  11.0.2.4                      0    nvidia
libcufile                 1.9.1.3                       0    nvidia
libcurand                 10.3.5.147                    0    nvidia
libcurl                   8.9.1                hdb1bdb2_0    conda-forge
libcusolver               11.4.4.55                     0    nvidia
libcusparse               12.0.2.55                     0    nvidia
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 hd590300_2    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-ng                 14.1.0               h77fa898_0    conda-forge
libgfortran-ng            14.1.0               h69a702a_0    conda-forge
libgfortran5              14.1.0               hc5f4f2c_0    conda-forge
libhwloc                  2.11.1          default_hecaa2ac_1000    conda-forge
libiconv                  1.17                 hd590300_2    conda-forge
liblapack                 3.9.0            16_linux64_mkl    conda-forge
liblapacke                3.9.0            16_linux64_mkl    conda-forge
libnghttp2                1.58.0               h47da74e_1    conda-forge
libnpp                    12.0.2.50                     0    nvidia
libnsl                    2.0.1                hd590300_0    conda-forge
libnvjitlink              12.1.105                      0    nvidia
libnvjpeg                 12.1.1.14                     0    nvidia
libprotobuf               4.25.3               h08a7969_0    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              14.1.0               hc0a3c3a_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libxml2                   2.12.7               he7c6b58_4    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
lightning-utilities       0.11.7             pyhd8ed1ab_0    conda-forge
llvm-openmp               15.0.7               h0cdce71_0    conda-forge
markupsafe                2.1.5           py310h2372a71_0    conda-forge
mkl                       2022.1.0           h84fe81f_915    conda-forge
mkl-devel                 2022.1.0           ha770c72_916    conda-forge
mkl-include               2022.1.0           h84fe81f_915    conda-forge
ml-collections            0.1.1              pyhd8ed1ab_0    conda-forge
modelcif                  0.7                pyhd8ed1ab_0    conda-forge
mpc                       1.3.1                h24ddda3_0    conda-forge
mpfr                      4.2.1                h38ae2d0_2    conda-forge
mpmath                    1.3.0              pyhd8ed1ab_0    conda-forge
msgpack-python            1.0.8           py310h25c7140_0    conda-forge
ncurses                   6.5                  he02047a_1    conda-forge
networkx                  3.3                pyhd8ed1ab_1    conda-forge
ninja                     1.11.1.1                 pypi_0    pypi
numpy                     1.26.0          py310hb13e2d6_0    conda-forge
ocl-icd                   2.3.2                hd590300_1    conda-forge
ocl-icd-system            1.0.0                         1    conda-forge
openmm                    7.7.0           py310hccf1d78_1    conda-forge
openssl                   3.3.1                hb9d3cd8_3    conda-forge
packaging                 24.1               pyhd8ed1ab_0    conda-forge
pandas                    2.2.2           py310hf9f9076_1    conda-forge
pcre2                     10.44                hba22ea6_2    conda-forge
pdbfixer                  1.8.1              pyh6c4a22f_0    conda-forge
perl                      5.32.1          7_hd590300_perl5    conda-forge
pip                       24.2               pyh8b19718_1    conda-forge
platformdirs              4.3.6              pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.38             pyha770c72_0    conda-forge
prompt_toolkit            3.0.38               hd8ed1ab_0    conda-forge
protobuf                  4.25.3          py310ha8c1f0e_0    conda-forge
psutil                    6.0.0           py310hc51659f_0    conda-forge
py-cpuinfo                9.0.0                    pypi_0    pypi
pycparser                 2.22               pyhd8ed1ab_0    conda-forge
pydantic                  2.9.2                    pypi_0    pypi
pydantic-core             2.23.4                   pypi_0    pypi
pynvml                    11.5.3                   pypi_0    pypi
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.10.14         hd12c33a_0_cpython    conda-forge
python-dateutil           2.9.0              pyhd8ed1ab_0    conda-forge
python-tzdata             2024.2             pyhd8ed1ab_0    conda-forge
python_abi                3.10                    5_cp310    conda-forge
pytorch                   2.1.2           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda              12.1                 ha16c6d3_5    pytorch
pytorch-lightning         2.4.0              pyhd8ed1ab_0    conda-forge
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2024.2             pyhd8ed1ab_0    conda-forge
pyyaml                    5.4.1           py310h5764c6d_4    conda-forge
readline                  8.2                  h8228510_1    conda-forge
requests                  2.32.3             pyhd8ed1ab_0    conda-forge
ruamel.yaml               0.17.21         py310h1fa729e_3    conda-forge
ruamel.yaml.clib          0.2.8           py310h2372a71_0    conda-forge
s2n                       1.5.1                h3400bea_0    conda-forge
scipy                     1.14.1          py310ha3fb0e1_0    conda-forge
sentry-sdk                2.16.0             pyhd8ed1ab_0    conda-forge
setproctitle              1.3.3           py310h2372a71_0    conda-forge
setuptools                59.5.0          py310hff52083_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
smmap                     5.0.0              pyhd8ed1ab_0    conda-forge
sympy                     1.13.3          pypyh2585a3b_103    conda-forge
tbb                       2021.12.0            h434a139_3    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
torchmetrics              1.4.2              pyhd8ed1ab_0    conda-forge
torchtriton               2.1.0                     py310    pytorch
tqdm                      4.62.2             pyhd8ed1ab_0    conda-forge
typing-extensions         4.12.2               hd8ed1ab_0    conda-forge
typing_extensions         4.12.2             pyha770c72_0    conda-forge
tzdata                    2024b                hc8b5060_0    conda-forge
urllib3                   1.26.19            pyhd8ed1ab_0    conda-forge
wandb                     0.16.6             pyhd8ed1ab_1    conda-forge
wcwidth                   0.2.13             pyhd8ed1ab_0    conda-forge
wheel                     0.44.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
zstd                      1.5.6                ha6fb4c9_0    conda-forge

During installation of 3rd-party dependencies, I get the following output, indicating that the dependencies did not install (setup.py install is part of this process and failed to run):

./scripts/install_third_party_dependencies.sh 
--2024-10-14 16:41:09--  https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
Resolving git.scicore.unibas.ch (git.scicore.unibas.ch)... 131.152.229.50
Connecting to git.scicore.unibas.ch (git.scicore.unibas.ch)|131.152.229.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9119 (8.9K) [text/plain]
Saving to: ‘openfold/resources/stereo_chemical_props.txt’

stereo_chemical_props.txt                     100%[=================================================================================================>]   8.91K  --.-KB/s    in 0.001s  

Last-modified header missing -- time-stamps turned off.
2024-10-14 16:41:10 (7.15 MB/s) - ‘openfold/resources/stereo_chemical_props.txt’ saved [9119/9119]

running install
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing openfold.egg-info/PKG-INFO
writing dependency_links to openfold.egg-info/dependency_links.txt
writing top-level names to openfold.egg-info/top_level.txt
reading manifest file 'openfold.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'openfold.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying openfold/resources/stereo_chemical_props.txt -> build/lib.linux-x86_64-3.10/openfold/resources
running build_ext
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no g++ version bounds defined for CUDA version 12.1
  warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'attn_core_inplace_cuda' extension
Emitting ninja build file /media/Data/binaries/github/openfold-pl_upgrades/build/temp.linux-x86_64-3.10/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /usr/bin/nvcc  -I/media/Data/binaries/github/openfold-pl_upgrades/openfold/utils/kernel/csrc/ -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/TH -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/THC -I/media/Data/binaries/miniconda3/envs/openfold/include/python3.10 -c -c /media/Data/binaries/github/openfold-pl_upgrades/openfold/utils/kernel/csrc/softmax_cuda_kernel.cu -o /media/Data/binaries/github/openfold-pl_upgrades/build/temp.linux-x86_64-3.10/openfold/utils/kernel/csrc/softmax_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++17 -maxrregcount=50 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=attn_core_inplace_cuda -D_GLIBCXX_USE_CXX11_ABI=0
FAILED: /media/Data/binaries/github/openfold-pl_upgrades/build/temp.linux-x86_64-3.10/openfold/utils/kernel/csrc/softmax_cuda_kernel.o 
/usr/bin/nvcc  -I/media/Data/binaries/github/openfold-pl_upgrades/openfold/utils/kernel/csrc/ -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/TH -I/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/THC -I/media/Data/binaries/miniconda3/envs/openfold/include/python3.10 -c -c /media/Data/binaries/github/openfold-pl_upgrades/openfold/utils/kernel/csrc/softmax_cuda_kernel.cu -o /media/Data/binaries/github/openfold-pl_upgrades/build/temp.linux-x86_64-3.10/openfold/utils/kernel/csrc/softmax_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++17 -maxrregcount=50 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=attn_core_inplace_cuda -D_GLIBCXX_USE_CXX11_ABI=0
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&)’:
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected template-name before ‘<’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected identifier before ‘<’ token
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression before ‘>’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression before ‘)’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                              ^
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/Data/binaries/github/openfold-pl_upgrades/setup.py", line 113, in <module>
    setup(
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/install.py", line 116, in do_egg_install
    self.run_command('bdist_egg')
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
    build_ext.build_extensions(self)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions
    self._build_extensions_serial()
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial
    self.build_extension(ext)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
    _build_ext.build_extension(self, ext)
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/distutils/command/build_ext.py", line 529, in build_extension
    objects = self.compiler.compile(sources,
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/media/Data/binaries/miniconda3/envs/openfold/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
Download CUTLASS, required for Deepspeed Evoformer attention kernel
Cloning into 'cutlass'...
remote: Enumerating objects: 6103, done.
remote: Counting objects: 100% (6103/6103), done.
remote: Compressing objects: 100% (1797/1797), done.
remote: Total 6103 (delta 3528), reused 4982 (delta 3018), pack-reused 0 (from 0)
Receiving objects: 100% (6103/6103), 27.71 MiB | 4.72 MiB/s, done.
Resolving deltas: 100% (3528/3528), done.
To make your changes take effect please reactivate your environment
To make your changes take effect please reactivate your environment

This is where I am stuck -- don't really know what to do with the "Error compiling objects for extension". I have already looked at #403 , #462 , and #477 and have done my best to implement their suggestions, but obviously do not have a fully working environment.

vaclavhanzl commented 1 month ago

@blakemertz Are you sure you are using your OS's gcc? Could you please activate your environment and try which gcc ? And gcc -v ? And should the version happen to be 13.3, could you please try mamba install gcc=12.4 ? This fixed it for me.

blakemertz commented 1 month ago

@vaclavhanzl thanks for responding. My OS gcc is v 12 -- I specifically deleted the existing symlink to gcc14 and recreated it to gcc12, checking with gcc -v in both my OS and in my openfold environment. I will double-check again and also try installing gcc=12.4 with mamba and let you know if that fixes the issue.

vaclavhanzl commented 1 month ago

@blakemertz Please try this environment from my PR #496

blakemertz commented 1 month ago

@vaclavhanzl thanks for sharing. I noticed you are using your own cuda tools (not included in environment.yml). Are you installing from your Debian repositories or pulling them from the nvidia channel in conda?

Update: never mind, I saw that it pulled in cudatoolkit (v 11.8) when I created the environment.

blakemertz commented 1 month ago

@vaclavhanzl thanks again for all your help. My guess is that the dependencies b/t gcc, numpy < 2, and pytorch w/CUDA 12 were making my original environment break. This was a time-consuming task on your part -- much appreciated.

While running the unit test after setting up the environment, I had 8 failed tests and had to modify two of the python scripts in the test directory as per #467 to reduce the number of failed tests to one:

./scripts/run_unit_tests.sh 
[2024-10-22 21:41:10,915] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
s.................Using /home/centos/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/centos/.cache/torch_extensions/py310_cu121/evoformer_attn/build.ninja...
Building extension module evoformer_attn...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module evoformer_attn...
Time to load evoformer_attn op: 0.2760050296783447 seconds
............s...s.sss.ss.E...sssssssss.sss....ssssss..s.s.s.ss.s......s.s..ss...ss.s.s....s........
======================================================================
ERROR: test_import_jax_weights_ (tests.test_import_weights.TestImportWeights)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/shared/binaries/github/openfold/tests/test_import_weights.py", line 37, in test_import_jax_weights_
    import_jax_weights_(
  File "/shared/binaries/github/openfold/openfold/utils/import_weights.py", line 650, in import_jax_weights_
    data = np.load(npz_path)
  File "/shared/miniconda3/envs/openfold/lib/python3.10/site-packages/numpy/lib/npyio.py", line 427, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/shared/binaries/github/openfold/tests/openfold/resources/params/params_model_1_ptm.npz'

----------------------------------------------------------------------
Ran 117 tests in 56.967s

FAILED (errors=1, skipped=41)

Test(s) failed. Make sure you've installed all Python dependencies.

I suppose one could explicitly point to the params_model_1_ptm.npz file by trying to pass the --jax_param_path flag, but not sure the exact syntax for that. I will consider this closed for now, hope your pull gets pushed back into the pl_upgrades branch b/c I am sure there are plenty of users rolling cuda12 and pytorch2 right now......

vaclavhanzl commented 1 month ago

@blakemertz Thanks for all the tests. To answer your question (sorry, it was too late night here when I saw it), as you already noticed, most things come from the environment.yml. My latest PR #496 further limits what is used from the OS distribution - I guess it is now just the kernel module. For others coming here via searches, I'll document things in more details. To get the kernel module, I did this on my Debian testing:

apt-get install nvidia-cuda-dev nvidia-cuda-toolkit linux-image-amd64 linux-headers-amd64

while having this in /etc/apt/sources.list:

deb http://deb.debian.org/debian/ testing main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ testing main contrib non-free non-free-firmware
deb http://security.debian.org/debian-security testing-security main contrib non-free non-free-firmware
deb-src http://security.debian.org/debian-security testing-security main contrib non-free non-free-firmware
deb http://deb.debian.org/debian/ testing-updates main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ testing-updates main contrib non-free non-free-firmware

Note that I explicitly avoided anything from the Nvidia website (I appreciate their nice efforts but using just the Debian repos is much simpler).

Even my apt-get setup is probably still an overkill installing things which will not be used. All you want on the OS level is to get nvidia-smi working:

hanzl@blackbox:~$ nvidia-smi 
Wed Oct 23 09:42:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
hanzl@blackbox:~$ cat /proc/version
Linux version 6.11.2-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-14 (Debian 14.2.0-6) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP PREEMPT_DYNAMIC Debian 6.11.2-1 (2024-10-05)

Using the environment with #496 applied, I get these versions:

(test_env5) hanzl@blackbox:~$ which nvcc
/home/hanzl/miniforge3/envs/test_env5/bin/nvcc
(test_env5) hanzl@blackbox:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(test_env5) hanzl@blackbox:~$ which gcc
/home/hanzl/miniforge3/envs/test_env5/bin/gcc
(test_env5) hanzl@blackbox:~$ gcc --version
gcc (conda-forge gcc 12.4.0-0) 12.4.0

I did many desperate things in the past while trying to install OpenFold (all my other posts here are likely obsoleted by this one). If you are reading this, you likely got your share of this pain, too. I learned that apart from installing what works, even more important is uninstalling what you installed before while searching for your way. Seriously, if a clean OS install is possible for you, it is a good start. Your previous experiments likely left you in a minefield of pitfalls which make debugging OpenFold's own problems extremely hard. You may try some cleanups I did in the past:

If your monitor is NOT plugged to your GPU (and you use it just for CUDA), you may do things as drastic as:

apt-get remove 'nvidia-*' 'libnvidia-*'

etc., until dpkg -l|grep nvidia returns nothing. Maybe something similar for packages with 'cuda' in the name.

Equally important is to clean up anything python related. If you experimented with various ways to make python virtual environments, you can have nasty landmines waiting in some very obscure places, triggered for certain versions of python only. Searching for good python version in a good environment for OpenFold can be easily spoiled by this. Verify directories along the python's library import path sys.path, maybe there is part of some old torch. My ghost was hidden in /home/hanzl/.local/lib/python3.9/site-packages.

vaclavhanzl commented 1 month ago

@blakemertz And for this issue 494 - I guess it should stay open until PR #496 (or something similar) is merged?

vaclavhanzl commented 3 weeks ago

PR #496 is now merged so I think this issue could be closed (please @blakemertz - looks like I cannot do that but you could, thanks).