gbouras13 / phold

Phage Annotation using Protein Structures
MIT License
76 stars 4 forks source link

ProstT5 issue: iteration over a 0-d array #47

Closed vdruelle closed 2 months ago

vdruelle commented 2 months ago

Description

I'm trying to generate annotation for a couple of phages from the BASEL collection (like this one from NCBI https://www.ncbi.nlm.nih.gov/nuccore/2071745857) to test the performance of the tool. I am first generating a genbank file using Pharokka, which seems to be fine since the tool completes the job and I obtain the pharokka.gbk file in the output folder.

I then try to use phold on this file with the command: phold run -i output/EM60_pharokka/pharokka.gbk -o output/EM60_phold -t 8 -f --cpu I'm running the cpu version since my GPU doesn't have enough memory for the gpu version.

The tool starts but eventually fails at the ProstT5 prediction step. I'm copy pasting the output of the terminal below. I tried figuring out what was the problem but it was unconclusive.

Do you have an idea where the issue comes from ? Thanks and have a great day !

2024-07-02 14:35:52.978 | INFO     | phold.utils.validation:instantiate_dirs:70 - Checking the output directory output/EM60_phold
2024-07-02 14:35:52.979 | INFO     | phold.utils.validation:instantiate_dirs:76 - --force was specified even though the output directory does not already exist. Continuing

.______    __    __    ______    __       _______  
|   _  \  |  |  |  |  /  __  \  |  |     |       \ 
|  |_)  | |  |__|  | |  |  |  | |  |     |  .--.  |
|   ___/  |   __   | |  |  |  | |  |     |  |  |  |
|  |      |  |  |  | |  `--'  | |  `----.|  '--'  |
| _|      |__|  |__|  \______/  |_______||_______/ 

2024-07-02 14:35:52.988 | INFO     | phold.utils.util:begin_phold:72 - phold: annotating phage genomes with protein structures
2024-07-02 14:35:52.988 | INFO     | phold.utils.util:begin_phold:74 - You are using phold version 0.1.4
2024-07-02 14:35:52.988 | INFO     | phold.utils.util:begin_phold:75 - Repository homepage is https://github.com/gbouras13/phold
2024-07-02 14:35:52.988 | INFO     | phold.utils.util:begin_phold:76 - You are running phold run
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:77 - Listing parameters
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --input output/EM60_pharokka/pharokka.gbk
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --output output/EM60_phold
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --threads 8
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --force True
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --prefix phold
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --evalue 0.001
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --database None
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --batch_size 1
2024-07-02 14:35:52.989 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --sensitivity 9.5
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --keep_tmp_files False
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --cpu True
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --omit_probs False
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --finetune False
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --finetune_path None
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --split False
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --split_threshold 60.0
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --card_vfdb_evalue 1e-10
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --separate False
2024-07-02 14:35:52.990 | INFO     | phold.utils.util:begin_phold:79 - Parameter: --max_seqs 1000
2024-07-02 14:35:52.993 | INFO     | phold.utils.validation:check_dependencies:117 - Foldseek version found is v8.ef4e960
2024-07-02 14:35:52.994 | INFO     | phold.utils.validation:check_dependencies:126 - Foldseek version is ok
2024-07-02 14:35:52.994 | INFO     | phold.databases.db:validate_db:234 - Checking Phold database installation in /home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/database
2024-07-02 14:35:52.994 | INFO     | phold.databases.db:validate_db:237 - All Phold databases files are present
2024-07-02 14:35:52.995 | INFO     | phold.io.handle_genbank:get_genbank:57 - Checking if input output/EM60_pharokka/pharokka.gbk is a Genbank file
2024-07-02 14:35:53.007 | INFO     | phold.utils.validation:validate_input:50 - Successfully parsed input output/EM60_pharokka/pharokka.gbk as a Genbank format file
2024-07-02 14:35:53.008 | INFO     | phold.features.predict_3Di:get_T5_model:121 - Using device: cpu
2024-07-02 14:35:53.008 | INFO     | phold.features.predict_3Di:get_T5_model:127 - Loading T5 from: /home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/database/Rostlab/ProstT5_fp16
2024-07-02 14:35:53.009 | INFO     | phold.features.predict_3Di:get_T5_model:128 - If /home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/database/Rostlab/ProstT5_fp16 is not found, it will be downloaded
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-02 14:35:54.396 | INFO     | phold.features.predict_3Di:get_T5_model:138 - Rostlab/ProstT5_fp16 loaded
2024-07-02 14:35:54.403 | INFO     | phold.features.predict_3Di:get_embeddings:362 - Beginning ProstT5 predictions
2024-07-02 14:35:54.403 | INFO     | phold.features.predict_3Di:get_embeddings:369 - Using models in full-precision
2024-07-02 14:40:08.983 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0058 MZ501093.1 prediction has length 0
2024-07-02 14:43:19.564 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0166 MZ501093.1 prediction has length 0
2024-07-02 14:43:21.652 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0174 MZ501093.1 prediction has length 0
2024-07-02 14:43:28.210 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0192 MZ501093.1 prediction has length 0
2024-07-02 14:43:31.288 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0200 MZ501093.1 prediction has length 0
2024-07-02 14:43:36.685 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0215 MZ501093.1 prediction has length 0
2024-07-02 14:43:39.533 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0221 MZ501093.1 prediction has length 0
Traceback (most recent call last):
  File "/home/valentin/miniconda3/envs/pholdENV/bin/phold", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/__init__.py", line 1355, in main
    main_cli()
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/__init__.py", line 281, in run
    subcommand_predict(
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/subcommands/predict.py", line 125, in subcommand_predict
    prediction_success = get_embeddings(
                         ^^^^^^^^^^^^^^^
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/features/predict_3Di.py", line 526, in get_embeddings
    write_predictions(predictions, output_3di, proteins_flag)
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/features/predict_3Di.py", line 206, in write_predictions
    [
  File "/home/valentin/miniconda3/envs/pholdENV/lib/python3.11/site-packages/phold/features/predict_3Di.py", line 210, in <listcomp>
    list(map(lambda yhat: ss_mapping[int(yhat)], yhats))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: iteration over a 0-d array
gbouras13 commented 2 months ago

Hi @vdruelle ,

The error seems to be caused by the warning lines:

2024-07-02 14:40:08.983 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0058 MZ501093.1 prediction has length 0
2024-07-02 14:43:19.564 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0166 MZ501093.1 prediction has length 0
2024-07-02 14:43:21.652 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0174 MZ501093.1 prediction has length 0
2024-07-02 14:43:28.210 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0192 MZ501093.1 prediction has length 0
2024-07-02 14:43:31.288 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0200 MZ501093.1 prediction has length 0
2024-07-02 14:43:36.685 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0215 MZ501093.1 prediction has length 0
2024-07-02 14:43:39.533 | WARNING  | phold.features.predict_3Di:get_embeddings:493 - BUDPYTUS_CDS_0221 MZ501093.1 prediction has length 0

which I would assume then causes the 0-d array issue.

When Phold tries to write the 3Di sequences, it will iterate over an empty array for these proteins, which clearly errors out.

I'll put in some fix for the next version.

To practically solve your error (as it'll take a while before the next release and in any case you'll miss potential annotations for these proteins!), the embedding failure that is the root cause of this error is probably caused by your hardware. Therefore, I'd recommend:

  1. Rerunning the command - I feel like this embedding error is a bit random and might disappear next time you run it; or
  2. Getting access to a computer with bigger GPU (or a more beefy CPU); or
  3. Use the colab notebook (especially if you have a small number of phages) that will give you a GPU with enough memory https://colab.research.google.com/github/gbouras13/phold/blob/main/run_pharokka_and_phold_and_phynteny.ipynb
  4. Send me the genbank file(s) and I can run it for you, shouldn't be a massive deal :) george.bouras@adelaide.edu.au

George

vdruelle commented 2 months ago

Hi @gbouras13,

Thanks a lot for the answer and suggestions to fix this problem. I'll give it a try in the following days.

Have a great day ! Valentin