eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
561 stars 105 forks source link

Error: annotation went wrong for pfam alignment in parallel #482

Closed zhaoc1 closed 12 months ago

zhaoc1 commented 1 year ago

Hi,

Here is my emapper command:

emapper.py -i centroids.ffn --itype CDS -m diamond --sensmode more-sensitive --data_dir eggnog_data  --cpu 10 --output batch_18 --override --dbmem --pfam_realign realign --temp_dir temp/ --output_dir output

I got an error message like the following:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/czhao/miniconda3/envs/geneannot/lib/python3.11/site-packages/eggnogmapper/annotation/pfam/pfam_scan.py", line 24, in pfam_align_parallel_scan
    for alignments in pool.imap(query_pfam_annotate_scan,
  File "/opt/czhao/miniconda3/envs/geneannot/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
ValueError: Error parsing fasta file. GUT_GENOME287165_01908 has no sequence

I doubled check the input centroids.ffn is valid FASTA and GUT_GENOME287165_01908 does have sequences in the input FASTA. So the error message seems to indicate missing sequences for GUT_GENOME287165_01908 at some intermediate steps. Any ideas what's going on here? Thank you!

Chunyu

Cantalapiedra commented 12 months ago

Hi @zhaoc1 ,

I am not able to reproduce the error. You may share your fasta file with me by email or link, if it isn't very large.

Best, Carlos

zhaoc1 commented 12 months ago

Hi Carlos,

Thanks for the reply. The original input centroids.ffn was multiple FASTA catted into one file. I tried to rerun eggnog with individual FASTAs instead, and I no longer encountered the same error 🤔

Anyway, I am closing this issue now.

Chunyu

Cantalapiedra commented 12 months ago

Hi @zhaoc1 ,

Thank you very much for your feedback.

Best, Carlos

zhaoc1 commented 12 months ago

Actually, I located the problem FASTA input (attached). I also attached the GNU Time log file.

Cantalapiedra commented 12 months ago

My guess is that this sequence is not detected as CDS, because it doesn't start with ATG:

GUT_GENOME287165_01908 TGA...

So the CDS will be empty. You could try using a different translation table, or translating them yourself to proteins and using --itype protein. I am not sure if UGA is a start codon anywhere...

Best, Carlos

zhaoc1 commented 12 months ago

It makes sense. Looking back to the prokka annotation of "GUT_GENOME287165_01908", it is "23S ribosomal RNA (partial)". Thanks Carlos.

Cantalapiedra commented 12 months ago

Ah if it is a rRNA it makes sense yes. We should add code that checks whether the CDS are non-empty. Sorry for the inconveniences. Best, Carlos