Error while running the protein example

abhinavb22 commented 5 months ago

Hello, I am trying to run the example code to predict protein monomer 7u7w_A and I get the following error `./SE3nv-20240131.sif -m rf2aa.run_inference --config-name protein

/usr/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'protein': Defaults list is missing _self_. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information warnings.warn(msg, UserWarning) Using the cif atom ordering for TRP. ./make_msa.sh examples/protein/7u7w_A.fasta 7u7w_protein/A 4 64 pdb100_2022Apr19/pdb100_2022Apr19 ./make_msa.sh: line 22: signalp6: command not found Running HHblits against UniRef30 with E-value cutoff 1e-10 ./make_msa.sh: line 48: hhblits: command not found ./make_msa.sh: line 50: hhfilter: command not found ./make_msa.sh: line 51: hhfilter: command not found grep: 7u7wprotein/A/hhblits/t000.1e-10.id90cov75.a3m: No such file or directory grep: 7u7wprotein/A/hhblits/t000.1e-10.id90cov50.a3m: No such file or directory Running HHblits against UniRef30 with E-value cutoff 1e-6 ./make_msa.sh: line 48: hhblits: command not found ./make_msa.sh: line 50: hhfilter: command not found ./make_msa.sh: line 51: hhfilter: command not found grep: 7u7wprotein/A/hhblits/t000.1e-6.id90cov75.a3m: No such file or directory grep: 7u7wprotein/A/hhblits/t000.1e-6.id90cov50.a3m: No such file or directory Running HHblits against UniRef30 with E-value cutoff 1e-3 ./make_msa.sh: line 48: hhblits: command not found ./make_msa.sh: line 50: hhfilter: command not found ./make_msa.sh: line 51: hhfilter: command not found grep: 7u7wprotein/A/hhblits/t000.1e-3.id90cov75.a3m: No such file or directory grep: 7u7wprotein/A/hhblits/t000.1e-3.id90cov50.a3m: No such file or directory Running HHblits against BFD with E-value cutoff 1e-3 ./make_msa.sh: line 82: hhblits: command not found ./make_msa.sh: line 84: hhfilter: command not found ./make_msa.sh: line 85: hhfilter: command not found grep: 7u7wprotein/A/hhblits/t000.1e-3.bfd.id90cov75.a3m: No such file or directory grep: 7u7wprotein/A/hhblits/t000.1e-3.bfd.id90cov50.a3m: No such file or directory cp: cannot stat '7u7wprotein/A/hhblits/t000.1e-3.bfd.id90cov50.a3m': No such file or directory Running PSIPRED ./make_msa.sh: line 112: 7u7w_protein/A/log/make_ss.stdout: No such file or directory Running hhsearch cat: 7u7wprotein/A/t000.ss2: No such file or directory cat: 7u7wprotein/A/t000.msa0.a3m: No such file or directory ./make_msa.sh: line 120: hhsearch: command not found Error executing job with overrides: [] Traceback (most recent call last): File "/home/abhinav22/Gohillab_AF/rosettafold_database/RoseTTAFold-All-Atom-main/rf2aa/run_inference.py", line 206, in main runner.infer() File "/home/abhinav22/Gohillab_AF/rosettafold_database/RoseTTAFold-All-Atom-main/rf2aa/run_inference.py", line 153, in infer self.parse_inference_config() File "/home/abhinav22/Gohillab_AF/rosettafold_database/RoseTTAFold-All-Atom-main/rf2aa/run_inference.py", line 46, in parse_inference_config protein_input = generate_msa_and_load_protein( File "/home/abhinav22/Gohillab_AF/rosettafold_database/RoseTTAFold-All-Atom-main/rf2aa/data/protein.py", line 93, in generate_msa_and_load_protein return load_protein(str(msa_file), str(hhr_file), str(atab_file), model_runner) File "/home/abhinav22/Gohillab_AF/rosettafold_database/RoseTTAFold-All-Atom-main/rf2aa/data/protein.py", line 56, in load_protein msa, ins, taxIDs = parse_a3m(msa_file) File "/home/abhinav22/Gohillab_AF/rosettafold_database/RoseTTAFold-All-Atom-main/rf2aa/data/parsers.py", line 415, in parse_a3m fstream = open(filename, 'r') FileNotFoundError: [Errno 2] No such file or directory: '7u7wprotein/A/t000.msa0.a3m'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. ` DO I have to install signalp6, hhblits etc separately ?

r-krishna commented 5 months ago

Hello, there are updated installation details in main now courtesy of @amorehead. please let me know if you still run into issues

Takk522 commented 5 months ago

Perhaps you need to modify the HHLIB path in line 33 of make_msa.sh. Run 'which hhblits' in your terminal. Copy the path and replace HHLIB, ensuring to exclude the 'hhblits/' part.

sean-workman commented 5 months ago

I think there are more problems with the make_msa.sh than just the hhblits path. The database paths seem to be hard coded as something weird as well. I am getting error with the script looking for a uniclust directory that doesn't exist.

Takk522 commented 5 months ago

I think there are more problems with the make_msa.sh than just the hhblits path. The database paths seem to be hard coded as something weird as well. I am getting error with the script looking for a uniclust directory that doesn't exist.

You should change the name of the folder from "UniRef30_2020_06" to "uniclust". Simply execute the command "mv UniRef30_2020_06 uniclust" within the RoseTTAFold-All-Atom folder.

sean-workman commented 5 months ago

This doesn't solve the problem unfortunately.

In make_msa.sh I see:

# sequence databases
DB_UR30="$PIPE_DIR/uniclust/UniRef30_2021_06"
DB_BFD="$PIPE_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"

Are those meant to be paths to directories? If so, it seem to make more sense to just remove the uniclust in the path because following your installation directions in the README gives UniRef30_2021_06 as a directory in the RoseTTAFold-All-Atom folder. I'm unsure what is going on with the bfd part. Following in the instructions in the README gives a directory named bfd containing:

bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata    bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex   bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata  bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex

When I try to run any predictions I see:

[...]anaconda/envs/RFAA/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'nucleic_acid': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Using the cif atom ordering for TRP.
./make_msa.sh examples/protein/7u7w_A.fasta 7u7w_protein_nucleic/A 4 64  pdb100_2021Mar03/pdb100_2021Mar03
Predicting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.03sequences/s]
Running HHblits against UniRef30 with E-value cutoff 1e-10
- 19:27:49.919 ERROR: In /opt/conda/conda-bld/hhsuite_1709621322429/work/src/ffindexdatabase.cpp:11: FFindexDatabase:

I've truncated the path there as well as the error because it's really all just saying that files couldn't be opened or don't exist as in the original post here, but what I'm most confused about is the fact that it's saying there is an error in /opt/conda/conda-bld/hhsuite_1709621322429/work/src/ffindexdatabase.cpp... Our conda installation on this machine isn't located in opt! So kind of unclear what the heck is going on and first steps to try and fix it.

Any help in the right direction would be much appreciated! :)

sean-workman commented 5 months ago

It seems my issue may have been fixed by altering database names as suggested in #24.

The README gives wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz as the link the download the UniRef database, but in #24 you mention on your system you mention its 2021_03, and in the version of make_msa.sh in the repo is 2021_06.

Should I be downloading a more up to date version?

abhinavb22 commented 5 months ago

I was able to run the code but it has been stuck at the Uniref30 step for about 2 days now: Using the cif atom ordering for TRP. ./make_msa.sh examples/protein/7u7w_A.fasta 7u7w_protein/A 4 64 pdb100_2021Mar03/pdb100_2021Mar03 Predicting: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 2.68sequences/s] Running HHblits against UniRef30 with E-value cutoff 1e-10

Our workstation has 12 cores and 32 GB memory, has 2 A4000 GPUs. Aren't these sufficient to run these examples? Also how long does it typically take for this example to complete?

amorehead commented 5 months ago

I also noticed some very long wait times for some of these example inputs. For my custom inputs, they have run much quicker than this.

sean-workman commented 5 months ago

It took <10 minutes to get through the input preparation step for me once I got the bash scripts sorted out, but we ended up running out of GPU memory straight away once that hurdle was passed. Old GPUs though, not entirely unexpected I suppose.

I'm not sure if the memory is allocated dynamically based on your machine, but from looking at output of the inference pipeline, the make_msa.sh script appears to allocate 4 CPUs and 64 GB RAM.

./make_msa.sh examples/protein/7u7w_A.fasta 7u7w_protein/A 4 64 pdb100_2021Mar03/pdb100_2021Mar03

I wonder if this allocation of RAM is the problem for you? @abhinavb22

abhinavb22 commented 5 months ago

I tried my custom protein, still gets stuck at Uniclust step. It looks like pdb100 runs perfectly in both cases:

Using the cif atom ordering for TRP. ./make_msa.sh examples/coa6.fasta coa6/A 12 16 pdb100_2021Mar03/pdb100_2021Mar03 Predicting: 100%|██████████████████████████| 1/1 [00:00<00:00, 2.86sequences/s] Running HHblits against UniRef30 with E-value cutoff 1e-10

I tried changing the cpus to 12 and memory to 16 to see if this is an allocation issue but still doesn't work. I will try running this at our high performance computing facility with lots of mem and see if that's the issue.

Sue-Fwl commented 1 month ago

@abhinavb22 , any updates on this matter please?

baker-laboratory / RoseTTAFold-All-Atom

Error while running the protein example #17