aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.74k stars 523 forks source link

Cuda/Pytorch/Installation Issues #172

Closed Cweb118 closed 2 years ago

Cweb118 commented 2 years ago

Hello! So I have been struggling with a strange issue that I hope you or someone would be able to help me with. Let me start by providing some information:

So I am not sure if this is a problem with how I am attempting to install openfold, or if something else is going on. Essentially after cloning the repo the first thing I would do is run scripts/install_third_party_dependencies.sh. This would then create an environment called openfold_venv, however this environment does not seem to withhold many of the required packages (i.e. torch is absent). Following this with scripts/activate_environment.sh seems to fail. I have tried alternatively used conda env create -f environment.yml, which sets up an environment in a different location. Either way, after setting up the environment I end up with one of the following issues, either during python setup.py install or during inference:

These are run into on clean installs with no conda or cudatoolkits installed anywhere else on the machine, so it is rather puzzling. As I said I am not sure if this is due to performing the install sequence incorrectly but I have tried several different solutions and they all seem to circle back to one of these errors.

I apologize as I know this is rather vague, but if you can offer any sort of guidance it would be greatly appreciated!

gahdritz commented 2 years ago

Try uninstalling PyTorch from your conda environment and then manually reinstall it using the instructions on the website here: https://pytorch.org/get-started/locally/. LMK if that helps.

Cweb118 commented 2 years ago

Ok so I got a fresh install set up and tried to use the pytorch installation for Cuda 11.3. I received the following error:

`(openfold_venv) cweber@Geiger:~/Desktop/openfold$ python3 setup.py install

running install

/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  setuptools.SetuptoolsDeprecationWarning,

/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools

/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. 

Use build and pip and other standards-based tools.
  EasyInstallDeprecationWarning,

running bdist_egg

running egg_info

writing openfold.egg-info/PKG-INFO

writing dependency_links to openfold.egg-info/dependency_links.txt

writing top-level names to openfold.egg-info/top_level.txt

reading manifest file 'openfold.egg-info/SOURCES.txt'

adding license file 'LICENSE'

writing manifest file 'openfold.egg-info/SOURCES.txt'

installing library code to build/bdist.linux-x86_64/egg

running install_lib

running build_py

running build_ext

Traceback (most recent call last):

  File "setup.py", line 95, in <module>
    'Programming Language :: Python :: 3.7,' 

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install.py", line 116, in do_egg_install
    self.run_command('bdist_egg')

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 434, in build_extensions
    self._check_cuda_version(compiler_name, compiler_version)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 812, in _check_cuda_version
    raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))

RuntimeError: 
The detected CUDA version (10.1) mismatches the version that was used to compile
PyTorch (11.3). Please make sure to use the same CUDA versions.`
gahdritz commented 2 years ago

Are you sure that your CUDA version isn't actually 10.1?

Cweb118 commented 2 years ago

Unfortunately, yes. Running torch.version.cuda returns 11.3 and there is no evidence of 10.1 on this machine.

gahdritz commented 2 years ago

Not even in your virtual environment anywhere? The error message does confirm that it knows that your torch CUDA is 11.3.

bing-song commented 2 years ago

I saw the same problem, namely, ""runtimeerror: Cuda error: no kernal image is available for execution on the device" when I run run_pretrained_openfold.py.

I am using Azure GPU and follows installation (Linux). No code has been changed.

Cweb118 commented 2 years ago

Not even in your virtual environment anywhere? The error message does confirm that it knows that your torch CUDA is 11.3.

Not that I can see anywhere. After purging the machine of all conda stuff to try and avoid this conflict, the only things I have done are run the install_third_party_dependencies.sh and the torch command you shared. Searching through the packages in the lib does not return any results for versions of cudatoolkit other than for 11.3

lzhangUT commented 2 years ago

Any solution to this issue @gahdritz I have been running into the same issue, and tried to install the CUDA and the whole thing. Following the instructions in the github, I was able to run precomputed_alignments_mmseqs.py, and got the alignments as expected. then I tried to run run_pretrained_openfold.py, my script and the error is like this: image it seems like starting to run model interference, but then ran into this cuda python issue. I also tried to conda install pytorch again, stil the same issue. and I checked my cuda: image and pytorch, it seems like they are fine. image

I am stuck here completely for my projects, any help will be appreciated.

gahdritz commented 2 years ago

Could you send the output of pip freeze?

Next, could you try downgrading to torch 1.10.1 and re-running python3 setup.py install? We recently upgraded to torch 1.12.0, so that might be the cause (granted, I can't reproduce this even with torch.1.12.0).

bing-song commented 2 years ago

Since this problem is very common to many people, I would like to give my detailed investigation and hopefully it will help to fix the problem soon.

I tested on two GPUs (Tesla K80 and Tesla M60) on with Microsoft Azure Machine Learning Studio. I observed the same problem on both GPUs. Here is the GPU info. gpu_Info_tesla_k80 gpu_info_tesla_m60

I tested both main and v1.0.0 branch. The issues are different. However, both issues have been reported on this thread and on #161. On v1.0.0, I observed something similar to # 161. On main, I observed what people reported here.

I tested with and without Docker container. The results are the same.

Here is Python and PyTorch information.

For v1.0.0 pytorch_info_v1

For main

pytorch_info_main

The command line input with docker containers are

cmd_main

The error message for main is err_main

There is no error message for v1.0.0 since both relax and no-relax pdb file has been produced. However, the pdb file is garbage as shown in the following image.

bad_pdb

@gahdritz Let me know if you need more information and how can I help to fix this problem.

gahdritz commented 2 years ago

Interesting---this seems to be the first time this is happening on non-Pascal GPUs.

I still can't reproduce this @bing-song, so I'll need some extra help here, if you don't mind. in openfold/utils/kernel/attention_core.py, on the newest version of OF, would you mind printing both attention_logits.device and v.device right before line 53 where it crashes?

Thanks btw for putting this all together!

bing-song commented 2 years ago

@gahdritz Here is the prints that I added around line 53 for main branch

Screen Shot 2022-08-02 at 9 03 00 AM

Here is the output (Not sure why END is not printed). The device cuda:0 is the correct one.

Screen Shot 2022-08-02 at 9 06 14 AM

gahdritz commented 2 years ago

What happens if you put torch.cuda.synchronize() right before that matmul, below the custom kernel call?

So strange that the kernel executes multiple times without crashing...

bing-song commented 2 years ago

It is the same. Here is the code that includes more prints. I added prints on the matmul on line 38 also. That is fine.

Screen Shot 2022-08-02 at 9 39 41 AM

Screen Shot 2022-08-02 at 9 39 21 AM

bing-song commented 2 years ago

@gahdritz Here is the fasta file for this test.

7s0c_A_unpacked_A TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVI RGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEI YQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFE

If you have a fasta file that you want me to try on my machine, let me know.

gahdritz commented 2 years ago

I don't think this has to do with any particular input sequence, since I can't reproduce this on my machine. One last thing, if you don't mind: could you try running it with that CUDA flag (CUDA_LAUNCH_BLOCKING) mentioned in the error message set to 1?

bing-song commented 2 years ago

After set CUDO_LAUCH_BLOCKING=1, I did not see more debug info in the output.

Screen Shot 2022-08-02 at 10 17 54 AM

Here is the output.

Screen Shot 2022-08-02 at 10 24 11 AM

bing-song commented 2 years ago

@gahdritz I am thinking about to open a ssh for my Azure GPU server for you to debug. Do you think this will help?

gahdritz commented 2 years ago

Yeah that would be great actually.

bing-song commented 2 years ago

@gahdritz Can you let me know how to give you the ssh login info?

gahdritz commented 2 years ago

Send it to my Gmail, which is just my GitHub username.

gahdritz commented 2 years ago

I think I resolved this in 6c89015. @lzhangUT, @bing-song, @epenning could you verify that the inference script works on your systems now?

bing-song commented 2 years ago

@gahdritz Just tried. I checked out the openfold main branch and rebuild docker container and run the inference with the precomputed MSA alignment data. I have exact the same error.

gahdritz commented 2 years ago

Could you try without docker?

bing-song commented 2 years ago

@gahdritz , it is working well without docker. The predicted structure is good compared with the electron microscopy.

Screen Shot 2022-08-03 at 1 13 46 PM

gahdritz commented 2 years ago

Excellent. I've since pushed a fix that should work for Docker. Could you give it a try? If that still doesn't work, could you change compute_capability, _ to compute_capability, error on line 55 of setup.py and print error?

bing-song commented 2 years ago

Tried docker. It still does not work. The same error message.

If change line 55 as

Screen Shot 2022-08-03 at 1 51 15 PM

The setup cannot go through. Here is the error output.

Screen Shot 2022-08-03 at 1 54 25 PM

gahdritz commented 2 years ago

You did the edit slightly wrong---you should replace compute_capability, _ with compute_capability, error and then print error, not replace it with compute_capability.

bing-song commented 2 years ago

not sure what you mean. Do you want the code like this?

Screen Shot 2022-08-03 at 2 29 59 PM

gahdritz commented 2 years ago

Yes exactly.

bing-song commented 2 years ago

With this change, I got the same error message, namely,

Screen Shot 2022-08-03 at 2 48 27 PM

gahdritz commented 2 years ago

Yes, but did you manage to capture the output of the print(error) line during the Docker build? I think that should tell us what's going wrong on your system.

bing-song commented 2 years ago

Yes, I captured the output. Here is ERROR: no CUDA-capable device is detected.

Screen Shot 2022-08-03 at 4 12 57 PM

We know this is not true since we can run GPU and get good results without docker.

gahdritz commented 2 years ago

Right. Hm. The GPU must not be visible at that stage of the container's construction for whatever reason. As a sanity check, could you enter the resulting container, delete openfold/build and re-run python3 setup.py install? I suspect that the model will work as intended after that.

bing-song commented 2 years ago

Yes, that works and make sense. However, it is not a fix.

As I understand, the GPU information is not available during docker image build. It only available during create a docker container. This is the reason you need docker run --gpus ...

gahdritz commented 2 years ago

Yes this is unfortunate. Maybe the approach I took of dynamically determining the right GPU architectures to compile for fundamentally doesn't work in this case. Is there any alternative to hard-coding in a bunch of additional architectures, slowing down the build for everyone else? Perhaps I could look for a GPU, and, if one is found, remove other architectures from a long, hardcoded list that would be used otherwise. I need to think about this.

gahdritz commented 2 years ago

Ok @bing-song I did the thing in the previous comment. Check out f3814c9. It should now compile kernels for 3.7 and other CC's by default.

lzhangUT commented 2 years ago

I think I resolved this in 6c89015. @lzhangUT, @bing-song, @epenning could you verify that the inference script works on your systems now?

I changed the VM from Tesla P40 to V100, now the inference worked fine.

bing-song commented 2 years ago

@gahdritz I confirmed that the installation is working on both Docker and ENV for Azure K80.

jonathanking commented 1 year ago

Not even in your virtual environment anywhere? The error message does confirm that it knows that your torch CUDA is 11.3.

Not that I can see anywhere. After purging the machine of all conda stuff to try and avoid this conflict, the only things I have done are run the install_third_party_dependencies.sh and the torch command you shared. Searching through the packages in the lib does not return any results for versions of cudatoolkit other than for 11.3

I just want to follow up on this comment by @Cweb118 , since I had the exact same issue as them (version mismatch) but not the issues brought up by others in this thread. My local install nvcc did not match what the openfold_venv conda installed torch was expecting. I used conda to install a newer version of nvcc via conda install cudatoolkit-dev -c conda-forge, which got rid of the mismatch error experienced by myself and @Cweb118. The issue is related to the fact that when installing pytorch via cuda, nvcc itself is not installed, so a locally/previously installed version of nvcc could cause a version mismatch error.