Closed blakemertz closed 3 weeks ago
@blakemertz Are you sure you are using your OS's gcc? Could you please activate your environment and try which gcc
? And gcc -v
? And should the version happen to be 13.3, could you please try mamba install gcc=12.4
? This fixed it for me.
@vaclavhanzl thanks for responding. My OS gcc is v 12 -- I specifically deleted the existing symlink to gcc14 and recreated it to gcc12, checking with gcc -v in both my OS and in my openfold environment. I will double-check again and also try installing gcc=12.4 with mamba and let you know if that fixes the issue.
@blakemertz Please try this environment from my PR #496
@vaclavhanzl thanks for sharing. I noticed you are using your own cuda tools (not included in environment.yml). Are you installing from your Debian repositories or pulling them from the nvidia channel in conda?
Update: never mind, I saw that it pulled in cudatoolkit (v 11.8) when I created the environment.
@vaclavhanzl thanks again for all your help. My guess is that the dependencies b/t gcc, numpy < 2, and pytorch w/CUDA 12 were making my original environment break. This was a time-consuming task on your part -- much appreciated.
While running the unit test after setting up the environment, I had 8 failed tests and had to modify two of the python scripts in the test directory as per #467 to reduce the number of failed tests to one:
./scripts/run_unit_tests.sh
[2024-10-22 21:41:10,915] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
s.................Using /home/centos/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/centos/.cache/torch_extensions/py310_cu121/evoformer_attn/build.ninja...
Building extension module evoformer_attn...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module evoformer_attn...
Time to load evoformer_attn op: 0.2760050296783447 seconds
............s...s.sss.ss.E...sssssssss.sss....ssssss..s.s.s.ss.s......s.s..ss...ss.s.s....s........
======================================================================
ERROR: test_import_jax_weights_ (tests.test_import_weights.TestImportWeights)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/shared/binaries/github/openfold/tests/test_import_weights.py", line 37, in test_import_jax_weights_
import_jax_weights_(
File "/shared/binaries/github/openfold/openfold/utils/import_weights.py", line 650, in import_jax_weights_
data = np.load(npz_path)
File "/shared/miniconda3/envs/openfold/lib/python3.10/site-packages/numpy/lib/npyio.py", line 427, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/shared/binaries/github/openfold/tests/openfold/resources/params/params_model_1_ptm.npz'
----------------------------------------------------------------------
Ran 117 tests in 56.967s
FAILED (errors=1, skipped=41)
Test(s) failed. Make sure you've installed all Python dependencies.
I suppose one could explicitly point to the params_model_1_ptm.npz file by trying to pass the --jax_param_path flag, but not sure the exact syntax for that. I will consider this closed for now, hope your pull gets pushed back into the pl_upgrades branch b/c I am sure there are plenty of users rolling cuda12 and pytorch2 right now......
@blakemertz Thanks for all the tests. To answer your question (sorry, it was too late night here when I saw it), as you already noticed, most things come from the environment.yml
. My latest PR #496 further limits what is used from the OS distribution - I guess it is now just the kernel module. For others coming here via searches, I'll document things in more details. To get the kernel module, I did this on my Debian testing:
apt-get install nvidia-cuda-dev nvidia-cuda-toolkit linux-image-amd64 linux-headers-amd64
while having this in /etc/apt/sources.list
:
deb http://deb.debian.org/debian/ testing main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ testing main contrib non-free non-free-firmware
deb http://security.debian.org/debian-security testing-security main contrib non-free non-free-firmware
deb-src http://security.debian.org/debian-security testing-security main contrib non-free non-free-firmware
deb http://deb.debian.org/debian/ testing-updates main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ testing-updates main contrib non-free non-free-firmware
Note that I explicitly avoided anything from the Nvidia website (I appreciate their nice efforts but using just the Debian repos is much simpler).
Even my apt-get setup is probably still an overkill installing things which will not be used. All you want on the OS level is to get nvidia-smi
working:
hanzl@blackbox:~$ nvidia-smi
Wed Oct 23 09:42:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
hanzl@blackbox:~$ cat /proc/version
Linux version 6.11.2-amd64 (debian-kernel@lists.debian.org) (x86_64-linux-gnu-gcc-14 (Debian 14.2.0-6) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP PREEMPT_DYNAMIC Debian 6.11.2-1 (2024-10-05)
Using the environment with #496 applied, I get these versions:
(test_env5) hanzl@blackbox:~$ which nvcc
/home/hanzl/miniforge3/envs/test_env5/bin/nvcc
(test_env5) hanzl@blackbox:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(test_env5) hanzl@blackbox:~$ which gcc
/home/hanzl/miniforge3/envs/test_env5/bin/gcc
(test_env5) hanzl@blackbox:~$ gcc --version
gcc (conda-forge gcc 12.4.0-0) 12.4.0
I did many desperate things in the past while trying to install OpenFold (all my other posts here are likely obsoleted by this one). If you are reading this, you likely got your share of this pain, too. I learned that apart from installing what works, even more important is uninstalling what you installed before while searching for your way. Seriously, if a clean OS install is possible for you, it is a good start. Your previous experiments likely left you in a minefield of pitfalls which make debugging OpenFold's own problems extremely hard. You may try some cleanups I did in the past:
If your monitor is NOT plugged to your GPU (and you use it just for CUDA), you may do things as drastic as:
apt-get remove 'nvidia-*' 'libnvidia-*'
etc., until dpkg -l|grep nvidia
returns nothing. Maybe something similar for packages with 'cuda' in the name.
Equally important is to clean up anything python related. If you experimented with various ways to make python virtual environments, you can have nasty landmines waiting in some very obscure places, triggered for certain versions of python only. Searching for good python version in a good environment for OpenFold can be easily spoiled by this. Verify directories along the python's library import path sys.path
, maybe there is part of some old torch. My ghost was hidden in /home/hanzl/.local/lib/python3.9/site-packages
.
@blakemertz And for this issue 494 - I guess it should stay open until PR #496 (or something similar) is merged?
PR #496 is now merged so I think this issue could be closed (please @blakemertz - looks like I cannot do that but you could, thanks).
I have tried several permutations to get openfold to install on my local machine, but no joy up to this point. Could use some help, as I need to install openfold as a dependency for a couple of other codes (in particular DiffDock-L). Here is my GPU, driver, and cuda:
My v12 of gcc/g++/gfortran on my OS is 12.4 -- I believe that 12.2 is the highest version supported by cuda 12.1/2, but 12.4 is what is included in my Debian testing repos.
My packages for the openfold environment, pulled from the pl_upgrades branch to be able to utilize pytorch v2 and cuda 12:
During installation of 3rd-party dependencies, I get the following output, indicating that the dependencies did not install (setup.py install is part of this process and failed to run):
This is where I am stuck -- don't really know what to do with the "Error compiling objects for extension". I have already looked at #403 , #462 , and #477 and have done my best to implement their suggestions, but obviously do not have a fully working environment.