FreshAirTonight / af2complex

Predicting direct protein-protein interactions with AlphaFold deep learning neural network models.
146 stars 19 forks source link

Error when generating features with `feature_mode='multimer'` #16

Open guzmanfj opened 1 year ago

guzmanfj commented 1 year ago

I want to generate features for a protein complex with a modified version of the example/run_fea_gen.sh script:

#!/bin/bash
# An example script of feature generation. This heavily depenedent on your installation,
# due to many third-party tools and multiple sequence libraries.
#
# You need to take care of these paths, python environment, and third-party sequence tools.
#. load_alphafold  ## set up proper AlphaFold conda environment.

DATA_DIR=/ibex/ai/reference/KSL/alphafold/2.3.1
af_dir=../src

if [ $# -eq 0 ]
  then
    echo "Usage: $0 <seq_file>"
    exit 1
fi
fasta_path=$1
out_dir=af2c_fea_test

# choices are "reduced_dbs", "full_dbs", "uniprot"
db_preset='full_dbs'

# choices are "monomer, multimer, monomer+species, monomer+fullpdb"
# Option "monomer" and "multimer" follows alphafold official datapipeline for monomeric and
# multimeric structure predictions, respectively.
#
# Option "monomer+species" is a modified monomeric pipeline such as the species information
# is recorded for MSA pairing using only monomeric input features. This option is recommended.
#feature_mode='monomer+species'
#
# Option "monomer+fullpdb": in addition to add species, it uses template pipeline for multimer
# rather the template pipeline for the original monomer modeling. The mulitmer template pipeline
# search full PDB for templates, which is more comprehensive than the monomer template pipeline.
# feature_mode='monomer+fullpdb'
feature_mode='multimer'

#max_template_date=2020-05-15  # CASP14 starting date
max_template_date=$(date +"%Y-%m-%d")  # current date

echo "Info: sequence file is $fasta_path"
echo "Info: out_dir is $out_dir"
echo "Info: db_preset is $db_preset"
echo "Info: feature mode is $feature_mode"
echo "Info: max_template_date is $max_template_date"

##########################################################################################

python $af_dir/run_af2c_fea.py --fasta_paths=$fasta_path --db_preset=$db_preset \
  --data_dir=$DATA_DIR --output_dir=$out_dir      \
  --uniprot_database_path=$DATA_DIR/uniprot/uniprot.fasta \
  --uniref90_database_path=$DATA_DIR/uniref90/uniref90.fasta \
  --mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
  --pdb_seqres_database_path=$DATA_DIR/pdb_seqres/pdb_seqres.txt \
  --bfd_database_path=$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --uniclust30_database_path=$DATA_DIR/uniref30/UniRef30_2022_02 \
  --template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files  \
  --max_template_date=$max_template_date                 \
  --obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
  --feature_mode=$feature_mode \
  --use_precomputed_msas=True

When running the script I obtain the following error:

$ ./run_fea_gen_mod.sh Q9S3U9-6.fasta
Info: sequence file is Q9S3U9-6.fasta
Info: out_dir is af2c_fea_test
Info: db_preset is full_dbs
Info: feature mode is multimer
Info: max_template_date is 2023-03-25
add_species is False
I0325 16:32:42.717077 47109242920640 templates.py:857] Using precomputed obsolete pdbs /ibex/ai/reference/KSL/alphafold/2.3.1/pdb_mmcif/obsolete.dat.
I0325 16:32:42.721372 47109242920640 run_af2c_fea.py:282] Using random seed 372986757380479995 for the data pipeline
Info: working on target Q9S3U9-6 at gpu202-23-l
I0325 16:32:42.721538 47109242920640 run_af2c_fea.py:144] Predicting Q9S3U9-6
I0325 16:32:42.726290 47109242920640 pipeline_multimer.py:287] Running monomer pipeline on chain A: sp|Q9S3U9|VIOC_CHRVO
I0325 16:32:42.726786 47109242920640 jackhmmer.py:133] Launching subprocess "/ibex/sw/csg/alphafold/2.3.1/el7.9_conda/miniconda3/envs/alphafold_2.3.1/bin/jackhmmer -o /dev/null -A /tmp/tmp5q6it5mi/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /tmp/tmpvq51lmhm.fasta /ibex/ai/reference/KSL/alphafold/2.3.1/uniref90/uniref90.fasta"
I0325 16:32:42.730009 47109242920640 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0325 16:37:28.661425 47109242920640 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 285.931 seconds
I0325 16:37:28.665437 47109242920640 jackhmmer.py:133] Launching subprocess "/ibex/sw/csg/alphafold/2.3.1/el7.9_conda/miniconda3/envs/alphafold_2.3.1/bin/jackhmmer -o /dev/null -A /tmp/tmpbc32gpxf/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /tmp/tmpvq51lmhm.fasta /ibex/ai/reference/KSL/alphafold/2.3.1/mgnify/mgy_clusters_2022_05.fa"
I0325 16:37:28.670499 47109242920640 utils.py:36] Started Jackhmmer (mgy_clusters_2022_05.fa) query
I0325 16:47:29.123045 47109242920640 utils.py:40] Finished Jackhmmer (mgy_clusters_2022_05.fa) query in 600.452 seconds
I0325 16:47:29.134068 47109242920640 hmmbuild.py:121] Launching subprocess ['/ibex/sw/csg/alphafold/2.3.1/el7.9_conda/miniconda3/envs/alphafold_2.3.1/bin/hmmbuild', '--hand', '--amino', '/tmp/tmpe_2th29r/output.hmm', '/tmp/tmpe_2th29r/query.msa']
I0325 16:47:29.147607 47109242920640 utils.py:36] Started hmmbuild query
I0325 16:47:29.319181 47109242920640 hmmbuild.py:128] hmmbuild stdout:
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             /tmp/tmpe_2th29r/query.msa
# output HMM file:                  /tmp/tmpe_2th29r/output.hmm
# input alignment is asserted as:   protein
# model architecture construction:  hand-specified by RF annotation
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1     query                  505   156   120     3.48  0.590

# CPU time: 0.15u 0.00s 00:00:00.15 Elapsed: 00:00:00.15

stderr:

I0325 16:47:29.319365 47109242920640 utils.py:40] Finished hmmbuild query in 0.172 seconds
I0325 16:47:29.319745 47109242920640 hmmsearch.py:103] Launching sub-process ['/ibex/sw/csg/alphafold/2.3.1/el7.9_conda/miniconda3/envs/alphafold_2.3.1/bin/hmmsearch', '--noali', '--cpu', '8', '--F1', '0.1', '--F2', '0.1', '--F3', '0.1', '--incE', '100', '-E', '100', '--domE', '100', '--incdomE', '100', '-A', '/tmp/tmpzilc_m4o/output.sto', '/tmp/tmpzilc_m4o/query.hmm', '/ibex/ai/reference/KSL/alphafold/2.3.1/pdb_seqres/pdb_seqres.txt']
I0325 16:47:29.331137 47109242920640 utils.py:36] Started hmmsearch (pdb_seqres.txt) query
I0325 16:47:38.230762 47109242920640 utils.py:40] Finished hmmsearch (pdb_seqres.txt) query in 8.899 seconds
Traceback (most recent call last):
  File "../src/run_af2c_fea.py", line 309, in <module>
    app.run(main)
  File "/ibex/sw/csg/alphafold/2.3.1/el7.9_conda/miniconda3/envs/alphafold_2.3.1/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/ibex/sw/csg/alphafold/2.3.1/el7.9_conda/miniconda3/envs/alphafold_2.3.1/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "../src/run_af2c_fea.py", line 289, in main
    predict_structure(
  File "../src/run_af2c_fea.py", line 155, in predict_structure
    feature_dict = data_pipeline.process(
  File "/ibex/user/guzmanfj/af2complex/src/alphafold/data/pipeline_multimer.py", line 341, in process
    chain_features = self._process_single_chain(
  File "/ibex/user/guzmanfj/af2complex/src/alphafold/data/pipeline_multimer.py", line 289, in _process_single_chain
    chain_features = self._monomer_data_pipeline.process(
  File "/ibex/user/guzmanfj/af2complex/src/alphafold/data/pipeline.py", line 238, in process
    msa_runner=self.hhblits_bfd_uniref_runner,
AttributeError: 'DataPipeline' object has no attribute 'hhblits_bfd_uniref_runner'

These are the contents of the Q9S3U9-6.fasta input file:

>sp|Q9S3U9|VIOC_CHRVO
MKRAIIVGGGLAGGLTAIYLAKRGYEVHVVEKRGDPLRDLSSYVDVVSSRAIGVSMTVRG
IKSVLAAGIPRAELDACGEPIVAMAFSVGGQYRMRELKPLEDFRPLSLNRAAFQKLLNKY
>sp|Q9S3U9|VIOC_CHRVO
MKRAIIVGGGLAGGLTAIYLAKRGYEVHVVEKRGDPLRDLSSYVDVVSSRAIGVSMTVRG
IKSVLAAGIPRAELDACGEPIVAMAFSVGGQYRMRELKPLEDFRPLSLNRAAFQKLLNKY
FreshAirTonight commented 1 year ago

Thank you for reporting this bug. It was caused by renaming of a variable that affects MSA search on the UniProt ref30 library. I pushed in a fix. Please give it a try.

guzmanfj commented 1 year ago

It seems to work now, it produced the features.pkl file. Thank you for your help!