Using precomputed MSA and PDB files for running massive 3d structure prediction

YoshitakaMo / localcolabfold

ColabFold on your local PC

MIT License

610 stars 135 forks source link

Using precomputed MSA and PDB files for running massive 3d structure prediction #274

Open berkeucar opened 1 week ago

berkeucar commented 1 week ago

Hello,

I have a fasta file containing thousands of peptide sequences. I wanted to predict their 3D structures using LocalColabFold 1.5.5 installed in an HPC cluster and I have access to GPU clusters as well. Now, I was successfully able to generate PDB & MSA files by following the post/issue: https://github.com/sokrypton/ColabFold/issues/563.

However, as I mentioned, I have multiple peptides in my fasta file and I would like to use my GPU access to produce 3D structure generations with colabfold_batch comment, using the PDB & MSA files I precomputed using the HPC cluster. This was asked in the attached issue but seems to fly under the radar.

Currenty, does LocalColabFold support massive prediction of peptides with the --pdb-hit-file flag?

YoshitakaMo commented 1 week ago

Did this not work?: https://github.com/sokrypton/ColabFold/issues/563#issuecomment-1914101245

I use colabfold_batch --pdb-hit-file foobar_pdb100_230517.m8 --local-pdb-path /home/database/pdb_mmcif/mmcif_files foobar.a3m <outputdir> for the prediction. /home/database/pdb_mmcif/mmcif_files contains more than 220,000 flattened 4-letter mmCIF files.

berkeucar commented 1 week ago

So, basically, I appended all my peptide sequences together, using ":" as the separator between them. Let's say that file's name is tmp.fasta. I obtained the files tmp.a3m and tmp_pdb100_230517.m8 from colabfold_search command. Then I was running the following code: colabfold_batch \ --amber \ --templates \ --num-recycle 3 \ --use-gpu-relax \ --pdb-hit-file tmp_pdb100_230517.m8 \ --local-pdb-path my_local_pdb/pdb_mmcif/mmcif_files \ --random-seed 0 \ --zip \ tmp_pdb100_230517.m8 \ output_folder

and I received the following error:

Could not generate input features tmp: string index out of range
= generate_input_feature(query_seqs_unique, query_seqs_cardinality, unpaired_msa, paired_msa,
   File "localacolabfold_env/bin/lib/python3.10/site-packages/colabfold/batch.py", line 1035, in generate_input_feature
     features_for_chain[protein.PDB_CHAIN_IDS[chain_cnt]] = feature_dict
 IndexError: string index out of range

YoshitakaMo commented 1 week ago

Please show me your commit hash number. For example, ColabFold on my machine has 1ccca5a53d20c909f3ccf8a4b81df804e6717cb1. This is the commit on Jul. 23, 2024.

2024-11-11 00:18:05,900 Running colabfold 1.5.5 (1ccca5a53d20c909f3ccf8a4b81df804e6717cb1)
2024-11-11 00:18:06,190 Running on GPU
2024-11-11 00:18:06,859 Found 5 citations for tools or databases
...
...
...

If your commit hash number is old, updating LocalColabFold will fix this issue.

berkeucar commented 1 week ago

Just in case, I freshly installed localcolabfold with the script install_colabfold_batch_linux.sh. Now, I cannot even obtain the msa files it gets stuck in MSA of the first peptide in the batch:

k-mer similarity threshold: 110
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 238
Target db start 1 to 209335862
[>                                                                ] 1.27% 4 eta 0s

I am running this on CPUs and my gcc version is 9.4.0.