aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.62k stars 482 forks source link

multimer using mmseqs generated sequence alignments? #377

Open emzodls opened 7 months ago

emzodls commented 7 months ago

Hello, I'm trying to run openfold multimer inference on some fasta files I have. I've been using the collabfold databases to generate the sequence alignments as these are smaller than the AF2 databases. This has worked for single sequences however, I'm having issues getting this to work for multimer inference. I'm getting ValueError: Missing 'uniprot_hits.sto' This is required for Multimer MSA pairing. Is there a way to use mmseqs alignments for multimer inference or do I have to use the AF2 alignment pipeline? Thanks.

christinaflo commented 7 months ago

Hi, yes currently you do need to use the AF2 alignment pipeline, but only for the uniprot alignments. So if you already have the mmseqs alignments, you can precompute the uniprot files like so to skip all the other AF2 alignments:

python scripts/precompute_alignments.py <input_dir> <output_dir> --uniprot_database_path <path_to_dbs>/uniprot/uniprot.fasta --jackhmmer_binary_path <path_to_jackhmmer_binary>

I'll look into adding functionality to avoid/replace this step.

jflucier commented 2 months ago

Hi,

when I run this command:

python ${OF_SCRIPTS}/precompute_alignments.py \
${IN} \
${OUT} \
--uniprot_database_path /tank/jflucier/mmseqs_dbs/uniprot/uniprot.fasta \
--jackhmmer_binary_path jackhmmer

I get this warning: WARNING:root:More than one input_sequence found in DTX1_DTX2.fa

Is this normal behavior? My fasta has 2 entries and looks like this:

>prot1
MSRPGHGGL.....
>prot2
MAMAPSPSLVQ...

Do I need to split fasta prior to running this command?

Thanks for your help JF