aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.83k stars 551 forks source link

Alignment taking too long #203

Open calmasri opened 2 years ago

calmasri commented 2 years ago

I was trying to generate new alignments using the precompute_alignments_mmseqs.py script:

python3 scripts/precompute_alignments_mmseqs.py  /fasta_dir/query_seqs.fasta \
    data/mmseqs_dbs \
    uniref30_2103_db \
    /fasta_dir  \
    /data/MMseqs2/build/bin/mmseqs \
        --hhsearch_binary_path /usr/bin/hhsearch \
    --env_db colabfold_envdb_202108_db \
    --pdb70 data/pdb70/pdb70

Where query_seqs.fasta was generated from scripts/data_dir_to_fasta.py and contains almost all the structures in data/pdb_mmcif/mmcif_files (minus ~500-1000 structures).

I'm running on a machine with the following specs: 4 GPUs - Tesla V100 GPU Memory: 64 (GB) Cpus: 32 Memory: 244 GB

The script has been running for about 5 days now, I'm not sure if it's normal. How long should it normally take, and would I need more than 3TB storage space allocated for the output?

gahdritz commented 2 years ago

Are you regenerating PDB alignments? There's no need to do that; we've pre-computed them all. See the RODA repository linked in the README.

lzhangUT commented 2 years ago

same issue here. use the same script as you outlined in the readme. tge query seuqence is just a regular 237 aa sequence ython3 scripts/precompute_alignments_mmseqs.py /fasta_dir/query_seqs.fasta \ data/mmseqs_dbs \ uniref30_2103_db \ /fasta_dir \ /data/MMseqs2/build/bin/mmseqs \ --hhsearch_binary_path /usr/bin/hhsearch \ --env_db colabfold_envdb_202108_db \ --pdb70 data/pdb70/pdb70

player1321 commented 1 year ago

Hi, @gahdritz, how much space does this data need?

gahdritz commented 1 year ago

I think the entire thing is around 2TB, but you can download subsets of it.