gbouras13 / phold

Phage Annotation using Protein Structures
MIT License
66 stars 3 forks source link

ValueError during Processing Foldseek output #42

Open acvill opened 1 month ago

acvill commented 1 month ago

Thanks for making and maintaining pharokka and phold!

Description

I'm running pharokka -> phold on a set of 853 complete (single-contig) phage genomes. phold gives a per_cds_predictions.tsv file for all phage except one: Pseudomonas phage PIP. After rerunning a few times, this does not appear to be a memory issue. Perhaps an interesting edge case?

What I Did

module load miniconda/24.3.0
conda activate /home/acv38/project/conda_envs/pharokka
mkdir -p /home/acv38/palmer_scratch/psa_promoters/annotations/pharokka/OR687155.1
pharokka.py -i /home/acv38/project/databases/pseudomonas_phage_16Apr2024/fna/OR687155.1.fna -o /home/acv38/palmer_scratch/psa_promoters/annotations/pharokka/OR687155.1 -f -p OR687155.1 -d /home/acv38/project/databases/pharokka/pharokka_v1.4.0_databases -t 8 -g phanotate
conda deactivate
conda activate /home/acv38/project/conda_envs/phold
mkdir -p /home/acv38/palmer_scratch/psa_promoters/annotations/phold/OR687155.1
phold run -i /home/acv38/palmer_scratch/psa_promoters/annotations/pharokka/OR687155.1/OR687155.1.gbk -o /home/acv38/palmer_scratch/psa_promoters/annotations/phold/OR687155.1 -f -p OR687155.1 -d /home/acv38/project/databases/phold/phold_structure_foldseek_db -t 8 --cpu
conda deactivate

Please find the original fasta file, the pharokka gbk file, my conda yml files, and all the relevant log files at this Dropbox link:

https://www.dropbox.com/scl/fo/29tp5my2me718pr6wwb5c/AH8ggBgegenkhEhSvlZGdSs?rlkey=ivmz8w6zkdr7i3v89013ocwth&st=fcppkjqi&dl=0

Error traceback

2024-06-17 11:05:32.017 | INFO     | phold.results.topfunction:get_topfunctions:35 - Processing Foldseek output
Traceback (most recent call last):
  File "/home/acv38/project/conda_envs/phold/bin/phold", line 10, in <module>
    sys.exit(main())
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/phold/__init__.py", line 1355, in main
    main_cli()
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/phold/__init__.py", line 298, in run
    subcommand_compare(
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/phold/subcommands/compare.py", line 372, in subcommand_compare
    filtered_topfunctions_df, weighted_bitscore_df = get_topfunctions(
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/phold/results/topfunction.py", line 58, in get_topfunctions
    foldseek_df[["contig_id", "cds_id"]] = foldseek_df["query"].str.split(
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/pandas/core/frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/gpfs/gibbs/project/turner/acv38/conda_envs/phold/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
gbouras13 commented 3 weeks ago

Hi @acvill ,

This is a very weird phage! Essentially, the error exists because Phold (fold seek) found 0 hits. My suspicion is that this is due to phanotate providing crappy gene calls. There were 179 CDS in your GenBank, whereas prodigal found only 90 in the paper with prokka (https://journals.asm.org/doi/10.1128/spectrum.03719-23). I will look into this separately for other reasons, because that is bizarre!

In the dev branch, I have added a line of code to warn and exit the users if Foldseek finds 0 hits - an unlikely occurrence but nonetheless possible.

George

acvill commented 2 weeks ago

Thanks for looking into this @gbouras13 ! A weird phage indeed...