pharokka protein crashed after completing mmseqs searches

luisalbertoc95 commented 11 months ago

pharokka version:1.4 & 1.5.1
Python version: Python 3.10.8
Operating System: Rocky Linux 8.7 (Green Obsidian)

Description

Hi @gbouras13, When trying to run pharokka_proteins.py in a set of 755001 ORFs I'm having an error due to a mismatch in lengths between the keys and columns in the pandas DataFrame. According to the log file, all mmseqs searches were completed.

Thank you!

What I Did

Command run: 

pharokka_proteins.py -i ${WD}/out.CAT.predicted_proteins.faa  \
-o ${WD}/pharokka_prot_out_assembly_1Kb_NoPhablesContigs_PhablesresolvedGenomes \
-d /ref/sahlab/data/viral_analysis_DBs/pharokka1.5_DBs \
-t 24 \
-e 1E-03 \
--force

Traceback: 
2023-10-31 21:26:34.164 | INFO     | post_processing:process_vfdb_results:2134 - Processing VFDB output.
2023-10-31 21:26:35.099 | INFO     | post_processing:process_vfdb_results:2197 - 46 VFDB virulence factors identified.
Traceback (most recent call last):
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/pharokka_proteins.py", line 213, in <module>
    main()
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/pharokka_proteins.py", line 172, in main
    pharok.process_dataframes()
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/proteins.py", line 526, in process_dataframes
    (tophits_df, vfdb_results) = process_vfdb_results(self.out_dir, tophits_df)
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/bin/post_processing.py", line 2198, in process_vfdb_results
    merged_df[["genbank", "desc_tmp", "vfdb_species"]] = merged_df[
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/lib/python3.10/site-packages/pandas/core/frame.py", line 4082, in __setitem__
    self._setitem_array(key, value)
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/lib/python3.10/site-packages/pandas/core/frame.py", line 4124, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/ref/sahlab/software/anaconda3/envs/pharokka1.5_env/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

pharokka_proteins_1698789518.5425682.log

gbouras13 commented 11 months ago

Hi @luisalbertoc95 ,

Thanks for reporting this bug and using Pharokka! I see you're using Phables too :)

I'm pretty sure this has to do with the VFDB naming (it's annoying :) ).

Would you be able to do a few things:

I'd upgrade to 1.5.1 regardless (that log is from v1.4.0).
Re-run this with --hmm_only. It should work to get all the PHROG annotations, but it will skip CARD and VFDB steps. So do that if you're in a hurry.
I'm sure you want the CARD and VFDB steps too, so would you be able to send me the VFDB output? In particular vfdb_results.tsv. George.bouras@adelaide.edu.au (it should be small enough to email or attach here). I'm pretty sure it's because one of the VFDB outputs has a strange character and if so I will implement a fix soon once I can replicate the error.

George

luisalbertoc95 commented 11 months ago

Hi George,

Thanks a lot for you suggestions. Running the code with --hmm_only worked! I'll send the vfdb_results.tsv to you.

Thank you,

Luis

gbouras13 commented 8 months ago

Hi @luisalbertoc95 ,

It took a while but I solved this error - it was a bug in pharokka to do with matching VFDB and other outputs.

If you re-run pharokka now it should work (but seemingly you were happy enough with --hmm_only so maybe you've moved on)

George

ebueren commented 8 months ago

Hello! I'm running pharokka 1.6.1 (fresh env and database install), and still receiving the same error (below). Running in --fast mode fixes the problem, so I think it seems like it has to do with the VFDB/CARD databases.

Pharokka version: 1.6.1 Python 3.10.8 OS: Linux, 3.10.0

Command: pharokka.py -i file.fna -f -o test.out -d /x/x/x/pharokka_db/ -t 32 -m -g prodigal --skip_mash


2024-01-22 20:59:20.921 | INFO     | __main__:main:379 - Post Processing Output.
2024-01-22 20:59:23.455 | INFO     | post_processing:create_mmseqs_tophits:2104 - Processing MMseqs2 outputs.
2024-01-22 20:59:23.455 | INFO     | post_processing:create_mmseqs_tophits:2105 - Processing PHROGs output.
2024-01-22 20:59:30.113 | INFO     | post_processing:process_vfdb_results:2309 - Processing VFDB output.
2024-01-22 20:59:30.149 | INFO     | post_processing:process_vfdb_results:2368 - 17 VFDB virulence factors identified.
Traceback (most recent call last):
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/pharokka.py", line 499, in <module>
    main()
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/pharokka.py", line 418, in main
    pharok.process_results()
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/post_processing.py", line 356, in process_results
    (merged_df, vfdb_results) = process_vfdb_results(
  File "/home/ebueren/miniconda3/envs/pharokka1.6/bin/post_processing.py", line 2369, in process_vfdb_results
    merged_df[["genbank", "desc_tmp", "vfdb_species"]] = merged_df[
  File "/home/ebueren/miniconda3/envs/pharokka1.6/lib/python3.10/site-packages/pandas/core/frame.py", line 4287, in __setitem__
    self._setitem_array(key, value)
  File "/home/ebueren/miniconda3/envs/pharokka1.6/lib/python3.10/site-packages/pandas/core/frame.py", line 4329, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/home/ebueren/miniconda3/envs/pharokka1.6/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

fluhus commented 4 months ago

Hi, I am having this issue as well on a fresh mamba+pharokka (1.7.1) install.

pharokka.py -i vir.fa -o vir.prk -d ~/data/pharokka

Same error. Adding --hmm_only or --fast did not help. Happy to provide additional information that could help debug this!

gbouras13 commented 4 months ago

Hi @fluhus ,

how big is your input? Is it very small? I have a feeling this error may be because MMseqs2 found no hits at all. I’ll try and replicate later this week and put in a fix if so.

george

fluhus commented 4 months ago

Thanks for the quick response!

Here is the input file (111K unzipped):

vir.fa.gz

gbouras13 commented 4 months ago

Hi @fluhus,

I have narrowed down your error to the '#' in the header. If you remove this it will work. I'll put in a bug fix at some point :)

George

fluhus commented 4 months ago

Thanks for looking into this! I removed the # signs from the names and now it runs :)

gbouras13 / pharokka

pharokka protein crashed after completing mmseqs searches #300

Description

What I Did