KosinskiLab / AlphaPulldown

https://doi.org/10.1093/bioinformatics/btac749
GNU General Public License v3.0
199 stars 46 forks source link

question: correct pipeline for mmseqs alignments in alphapulldown? #391

Open gieses opened 2 months ago

gieses commented 2 months ago

Currently, I am not able to combine mmseqs alignments with the most recent alphapulldown code base. I have a few questions to check if I understood correctly from here and if the information is still correct.

1) I ran the alignment with colabfold and got the .a3m files (is it correct that these are single alignments, e.g. all baits and the query in individual entries in the fasta or should it always be baits + query separated by ":"?)

Note: Colabfold changed the naming to return the full fasta header, replacing most of the unfriendly characters with "_"

2) Is the renaming script still needed here? When I ran it in this directory it renames everything to a single file 101.a3m

3) Proceeding with the FASTA names (without the renaming because it seems to be a valid format), I tried to run

(alphapulldown) ➜  AlphaPulldown git:(main) ✗ python alphapulldown/scripts/create_individual_features.py --fasta_paths=example/fasta/all.fasta --data_dir=/data/openfold/ --output_dir=example/aln/af_colab --max_template_date=2025-01-01

where example/aln/af_colab pointed to the directory with the a3m files and all.fasta contained all single sequences.

4) Running this command starts jackhmmer again which I think should not happen? it says "Will use hmmsearch looking for templates" so maybe that is intended? 5) Trying to follow Option 2 details for the mmseqs guideline, I run into two issues. 1) the max template date is required and 2) if this is set path_to_mmt is required.

So trying this for example, does not work:

python alphapulldown/scripts/create_individual_features.py --fasta_paths=example/fasta/all.fasta --data_dir=/data/openfold/ --output_dir=example/aln/af_colab --skip_existing=False --use_mmseqs2=True --max_template_date=2025-01-01

Adding use_precomputed_msas does not help either, it just starts the jackhmmer alignment vs. uniref

This is running with colabfold 1.5.5, AlphaPulldown (30efa3f).

Thanks for any help and comments!

DimaMolod commented 1 month ago

Hi @gieses and thanks for your interest in AlphaPulldown! I am sorry, I wasn't coding this part, but I think you can generate features with mmseqs directly using AlphaPulldown and ColabFold remote API. Could you please try to install the latest beta release of AP and generate features using a command like this one:

create_individual_features.py --fasta_paths=/path/to/fasta.fasta --data_dir=/path/to/alphafold/databases --save_msa_files=True --output_dir=/path/to/output/dir --max_template_date=2050-01-01 --skip_existing=True (optional) --use_mmseqs2

Please let me know if it works for you now.

jkosinski commented 1 month ago

Dear @gieses, if you still want to use the local colabfold search alignments instead of the API suggested by @DimaMolod , this is indeed got broken with the changes in colabfold and we need to fix it. If it starts the jackhmmer alignment vs. uniref it likely means that the renaming script is still needed and to be fixed as apparently use_precomputed_msas does not recognize the MSAs and re-runs the search.

Regarding the template search - unfortunately, currently, in this mode, the template search still needs to be run locally.

The ") the max template date is required and 2) if this is set path_to_mmt is required" is unintended and we should fix it.

Thanks for testing this mode and reporting the issues in detail!