Build openfold with newer ppytorch + cuda

chunhui-shi commented 7 months ago

Right now openfold asks for old pytorch + cuda (11.2), thus latest linux is not able to build openfold.

Would like to upgrade the supported pytorch + cuda and other python packages accordingly, so people can use newer platform(OS, etc)

vaclavhanzl commented 6 months ago

Indeed, preparing the environment on newer platform is far from easy. I just did it this way on my Debian testing rolling setup:

mamba create -n of
mamba activate of
mamba install -c pytorch -c nvidia -c conda-forge -c bioconda pytorch pytorch-cuda=12.1 python=3.10 packaging ninja hhsuite kalign2 openmm pdbfixer biopython pytorch-lightning PyYAML tqdm wandb awscli aria2 hmmer deepspeed dm-tree py3Dmol modelcif
pip install flash-attention ml_collections git+https://github.com/NVIDIA/dllogger.git
git clone git@github.com:aqlaboratory/openfold.git
cd openfold
sed -i -e 's/-std=c++14/-std=c++17/' setup.py 
scripts/install_third_party_dependencies.sh
mamba deactivate
mamba activate of

When iterating towards these lines, I encountered few pitfalls:

hhsuite requires python 10 (not 11)
pytorch 2.2.0 requiers compilation with c++17, not 14
you really need well installed CUDA (that it worked for other things is not enough)

And the preceding CUDA setup has pitfalls as well:

non-free-firmware section is new in Debian, might be missing in sources.list but is vital
recent Nvidia Ampere driver does not compile with recent Linux kernel but Debian has fixed driver in bookworm-updates

I find CUDA setup via Debian repos easier than via Nvidia (in fact at this moment the critical bug 4336331 in Nvidia driver is only fixed in Debian). I add these sources:

deb http://deb.debian.org/debian/ bookworm-updates main non-free contrib non-free-firmware
deb-src http://deb.debian.org/debian/ bookworm-updates main non-free contrib non-free-firmware

and install:

apt install nvidia-cuda-dev nvidia-cuda-toolkit

I got these package versions:

pytorch              2.2.0    py3.10_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda         12.1              ha16c6d3_5    pytorch
python               3.10.13   hd12c33a_1_cpython    conda-forge
packaging            23.2            pyhd8ed1ab_0    conda-forge
ninja                1.11.1            h924138e_0    conda-forge
hhsuite              3.3.0  py310pl5321h068649b_9    bioconda
kalign2              2.04              h031d066_5    bioconda
openmm               8.1.1        py310h358ce72_1    conda-forge
pdbfixer             1.9             pyh1a96a4e_0    conda-forge
biopython            1.83         py310h2372a71_0    conda-forge
pytorch-lightning    2.1.3           pyhd8ed1ab_0    conda-forge
pyyaml               6.0.1        py310h2372a71_1    conda-forge
tqdm                 4.66.2          pyhd8ed1ab_0    conda-forge
wandb                0.16.3          pyhd8ed1ab_0    conda-forge
awscli               2.15.21      py310hff52083_0    conda-forge
aria2                1.37.0            h347180d_1    conda-forge
hmmer                3.4               hdbdd923_0    bioconda
deepspeed            0.13.1   cpu_py310h11dbdba_0    conda-forge
dm-tree              0.1.8        py310h620c231_2    conda-forge
py3dmol              2.0.4           pyhd8ed1ab_0    conda-forge
modelcif             0.9             pyhd8ed1ab_0    conda-forge
flash-attention      1.0.0                 pypi_0    pypi
ml-collections       0.1.1                 pypi_0    pypi
dllogger             1.0.0                 pypi_0    pypi

and outside mamba environment, I have:

$ gcc --version
gcc (Debian 10.3.0-15) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

$ nvidia-smi 
NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0

$ cat /proc/version
Linux version 6.6.15-amd64 (debian-kernel@lists.debian.org) (gcc-13 (Debian 13.2.0-13) 13.2.0, GNU ld (GNU Binutils for Debian) 2.42) #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2 (2024-02-04)

It is possible that OpenFold will need some little tweeks here and there with this setup. But I hope this helps a little bit...

vaclavhanzl commented 6 months ago

Meanwhile, PR #407 just landed in the codebase (thanks @jnwei and everybody involved!) and it is supposed to tackle these issues. While it certainly moves the code forward (and may likely contain some needed "tweeks here and there" I mentioned above), it still does not allow me to install an environment just following the install instructions in README. When I do:

mamba env create -n openfold_env -f environment.yml

it tries to install some older packages than PR407 description suggests:

Looking for: ['python=3.9', 'libgcc=7.2', 'setuptools=59.5.0', 'pip', 'openmm=7.7', 'pdbfixer', 'cudatoolkit=11.3', 'pytorch-lightning==1.5.10', 'biopython==1.79', 'numpy==1.21', 'pandas==2.0', 'pyyaml==5.4.1', 'requests', 'scipy==1.7', 'tqdm==4.62.2', 'typing-extensions==3.10', 'wandb==0.12.21', 'modelcif==0.7', 'awscli', 'ml-collections', 'aria2', 'git', 'bioconda::hmmer==3.3.2', 'bioconda::hhsuite==3.3.0', 'bioconda::kalign2==2.04', 'pytorch::pytorch=1.12']

and then fails with:

      RuntimeError:
      The detected CUDA version (12.0) mismatches the version that was used to compile
      PyTorch (11.3). Please make sure to use the same CUDA versions.

In the current environment.yml I find suspicious pytorch-lightning==1.5.10 which might lead to older Pytorch (?), in my experiments above I got pytorch-lightning=2.1.3. (Or maybe python=3.9 is a problem? Certainly python=3.10 worked better for me and might influence the available Pytorch versions.)

Also, the C++14/17 patch (which I did above via sed) would likely be needed for the following compilation in scripts/install_third_party_dependencies.sh to succeed.

vaclavhanzl commented 6 months ago

@abeebyekeen I see that you also devoted considerable effort to setting up environment.yml. It would be nice to hear how your current setup works for you. (I am particularly interested in effects of the 'cuda' conda package - maybe it allows even more minimalist cuda setup in the operating system? Just kernel driver?) Or how it compares with my setup (2nd post in this thread) if you have any incentive to try.

abeebyekeen commented 6 months ago

Hi @vaclavhanzl. Yes, I spent a good part of last weekend trying to setup a tool that requires openfold as a dependency. I was initially unable to build openfold due to a number of problems including -std=c++14 , gcc/g++, CUDA version mismatch errors as you have mentioned. Please note that I also had other CUDA errors thrown by pytorch. So I created a fork to try and figure out where each error was coming from (especially with building openfold).

Here are what I've got and the selections that eventually worked for me in solving all the problems:

Python version: 3.9.18
Linux: Red Hat 4.8.5-44, 3.10.0-1160.88.1.el7.x86_64

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

$ gcc --version
gcc (GCC) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.

For the environment I needed, here is how I set it up:

mamba create -npl python==3.9 pip mamba activate npl mamba install cudatoolkit==11.8.* python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --no-cache-dir python -m pip install torch-scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.2.0+118.html --no-cache-dir python -m pip install "git+https://github.com/facebookresearch/pytorch3d.git" --force-reinstall --no-deps --no-cache-dir

To get openfold to build, I was able to use the environment.yml in the openfold repo without changes. I, however, had to use a gcc version between 5. and 11., and also set the build flags in setup.py to std=c++17 just like @vaclavhanzl did:

git clone https://github.com/aqlaboratory/openfold.git
cd openfold
sed -i 's/std=c++14/std=c++17/g' setup.py
python -m pip install .

python -m pip install -r other_requirements.txt

And that works perfectly.

vaclavhanzl commented 6 months ago

Thanks a lot @abeebyekeen for sharing this with us, in a very clear way! I'd very much like to see OpenFold working out of the box for most new users, especially now when there are new great and publicized features. I still do not know how a PR going that way should look, addressing different setups people might have is hard. I was thinking about having ranges of versions in environment.yml but that is not an easy way either. I guess @jnwei might have some plan here and I hope we help at least a little bit to move that way.

jnwei commented 6 months ago

Hi all,

Thanks for all the interest and for sharing notes! The environments I wanted to support at this time were:

For use with CUDA 11.x: (main branch) pytorch 1.12 + pytorch lightning 1.5.10 (+ flash-attn dependencies)
For use with CUDA 12.x: (pl_upgrades) pytorch 2.1 + pytorch lightning 2.x

I just checked that the pl_upgrades branch on two systems I have access to with pre-installed CUDA 12, and found that they were working for me. Let me know if folks have issues with this environments.

My understanding was that having an environment which is CUDA 11.x + Pytorch 2.x is complicated, as the default pytorch 2 packages are built on CUDA 12 (leading to the CUDA mismatch error @vaclavhanzl saw). It looks like @abeebyekeen was able to find a workaround with a lot of elbow grease, thanks for sharing your fix!

I plan on cleaning up the documentation for this project and when I do, I'll add a page regarding the supported environments.

vaclavhanzl commented 6 months ago

Thanks @jnwei ! I am happy to report that the pl_upgrades branch works flawlessly with my CUDA 12, including compilations in install_third_party_dependencies.sh (the end of the 2nd post here describes my exact environment outside conda).

(And please @jnwei excuse my rather misguided comments on PR #407 - I totally overlooked that you merged to pl_upgrades, not to main.)

wenyan4work commented 6 months ago

quick question about future plan: is pl_upgrades going to be merged into master?

lm-jkominek commented 6 months ago

Hi there, just wanted to follow up on this and ask if there are any plans/timelines to merge pl_upgrades into main to get the CUDA12 support into openfold? Many thanks in advance! @vaclavhanzl @jnwei

jnwei commented 6 months ago

Hi thanks for the interest. We're actively working on finalizing the changes in pl_upgrades into main. I'd expect ~3ish weeks

lm-jkominek commented 6 months ago

Thank you @jnwei , appreciate the update!

jnwei commented 4 months ago

A quick note on the pytorch 2 / CUDA 12 upgrade:

We've run into some technical issues with the pytorch 2 upgrade. Briefly, we observe large instabilities in our training losses in the pytorch2 version relative to our pytorch 1 version.

For inference, we're also observing a slight difference between model outputs in pytorch 1 and pytorch 2. The difference in final output coordinates is about RMSD~0.05A for the proteins I've looked at While these differences might seem small, it may point to a larger issue that is also occurring in training; we're currently looking into it.

Until we find the root cause of the discrepancy, or a way around the training instability, we're not ready to update the main branch to pytorch 2.

Meanwhile, we will upgrade the main branch to use pytorch lightning 2, which has a few features that the team has found useful. I'll also push some changes to pl_upgrades that integrate some of the changes from the main branch, and cleans up the conda environment / docker for a CUDA 12 / pytorch 2.

We are actively working on debugging the instability, and we'll keep you posted as soon as we are ready to upgrade. Thank you all for your interest and your patience.

Dhruv-reviv commented 2 months ago

Hi @jnwei @vaclavhanzl @abeebyekeen I am trying to use pytorch 1 for openfold installation. I have following dependencies,

cudatoolkit 11.6.2 hfc3e2af_13 conda-forge debugpy 1.8.2 py310h76e45a6_0 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge deepspeed 0.12.4 pypi_0 pypi dllogger 1.0.0 pypi_0 pypi dm-tree 0.1.6 pypi_0 pypi flash-attn 2.5.9.post1 pypi_0 pypi hhsuite 3.3.0 py310pl5321hc31ed2c_11 bioconda hjson 3.1.0 pypi_0 pypi hmmer 3.3.2 hdbdd923_4 bioconda kalign2 2.04 h031d066_6 bioconda mkl 2022.1.0 h84fe81f_915 https://aws-ml-conda.s3.us-west-2.amazonaws.com mkl-devel 2022.1.0 ha770c72_916 conda-forge mkl-include 2022.1.0 h84fe81f_915 conda-forge ml-collections 0.1.1 pyhd8ed1ab_0 conda-forge mmseqs2 15.6f452 pl5321h6a68c12_2 bioconda modelcif 0.7 pyhd8ed1ab_0 conda-forge numpy 1.26.4 py310hb13e2d6_0 conda-forge python 3.10.14 hd12c33a_0_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-tzdata 2024.1 pyhd8ed1ab_0 conda-forge python_abi 3.10 4_cp310 conda-forge pytorch 1.12.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch pytorch-lightning 2.1.4 pyhd8ed1ab_0 conda-forge

While executing the inference script for single sequence, from openfold.config import model_config from openfold.data import templates, feature_pipeline, data_pipeline from openfold.data.tools import hhsearch, hmmsearch from openfold.np import protein from openfold.utils.script_utils import (load_models_from_command_line, parse_fasta, run_model, prep_output, relax_protein) from openfold.utils.tensor_utils import tensor_tree_map from openfold.utils.trace_utils import ( pad_feature_dict_seq, trace_model_, )

I am having the following issue,

----> 7 from openfold.utils.script_utils import (load_models_from_command_line, parse_fasta, run_model, 8 prep_output, relax_protein) 9 from openfold.utils.tensor_utils import tensor_tree_map 10 from openfold.utils.trace_utils import ( 11 pad_feature_dict_seq, 12 tracemodel, 13 )

File ~/openfold/openfold/utils/script_utils.py:10 7 import numpy 8 import torch ---> 10 from openfold.model.model import AlphaFold 11 from openfold.np import residue_constants, protein 12 from openfold.np.relax import relax

File ~/openfold/openfold/model/model.py:29 22 from openfold.utils.feats import ( 23 pseudo_beta_fn, 24 build_extra_msa_feat, 25 dgram_from_positions, 26 atom14_to_atom37, 27 ) 28 from openfold.utils.tensor_utils import masked_mean ---> 29 from openfold.model.embedders import ( 30 InputEmbedder, 31 InputEmbedderMultimer, 32 RecyclingEmbedder, 33 TemplateEmbedder, 34 TemplateEmbedderMultimer, 35 ExtraMSAEmbedder, 36 PreembeddingEmbedder, 37 ) 38 from openfold.model.evoformer import EvoformerStack, ExtraMSAStack 39 from openfold.model.heads import AuxiliaryHeads

File ~/openfold/openfold/model/embedders.py:29 22 from openfold.utils import all_atom_multimer 23 from openfold.utils.feats import ( 24 pseudo_beta_fn, 25 dgram_from_positions, 26 build_template_angle_feat, 27 build_template_pair_feat, 28 ) ---> 29 from openfold.model.primitives import Linear, LayerNorm 30 from openfold.model.template import ( 31 TemplatePairStack, 32 TemplatePointwiseAttention, 33 ) 34 from openfold.utils import geometry

File ~/openfold/openfold/model/primitives.py:30 28 fa_is_installed = importlib.util.find_spec("flash_attn") is not None 29 if fa_is_installed: ---> 30 from flash_attn.bert_padding import unpad_input 31 from flash_attn.flash_attn_interface import flash_attn_unpadded_kvpacked_func 33 import torch

File /opt/conda/envs/openf/lib/python3.10/site-packages/flash_attn/init.py:3 1 version = "2.5.9.post1" ----> 3 from flash_attn.flash_attn_interface import ( 4 flash_attn_func, 5 flash_attn_kvpacked_func, 6 flash_attn_qkvpacked_func, 7 flash_attn_varlen_func, 8 flash_attn_varlen_kvpacked_func, 9 flash_attn_varlen_qkvpacked_func, 10 flash_attn_with_kvcache, 11 )

File /opt/conda/envs/openf/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py:10 6 import torch.nn as nn 8 # isort: off 9 # We need to import the CUDA kernels after importing torch ---> 10 import flash_attn_2_cuda as flash_attn_cuda 12 # isort: on 15 def _get_block_size_n(device, head_dim, is_dropout, is_causal): 16 # This should match the block sizes in the CUDA kernel

ImportError: /opt/conda/envs/openf/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE

Any help would be appreciated!

RJ3 commented 3 weeks ago

I just checked that the pl_upgrades branch on two systems I have access to with pre-installed CUDA 12, and found that they were working for me. Let me know if folks have issues with this environments.

It doesn't appear working now, see issue #477

I also recommend cuda-nvcc=12.4.131 to get it closer to a complete environment, this is recommended over installing it with system packages.

aqlaboratory / openfold

Build openfold with newer ppytorch + cuda #403