MetazoaPhylogenomicsLab / FANTASIA

GNU General Public License v3.0
25 stars 2 forks source link

Input not a valid protein fasta file #8

Open KatharinaHoff opened 2 hours ago

KatharinaHoff commented 2 hours ago

Hi!

Great work with FANTASIA! I have been testing it in various scenarios. When applying to metagenomic eukaryotic data, I rather often have incomplete genes predicted (but tool such as AUGUSTUS may also do that in single species genomes). If they are incomplete on the 5'-end, then they may start with an X because the 1 or 2 nucleotide at the beginning of sequence may not translate. In this case, FANTASIA dies with the error message:

ERROR: Input file is not a protein FASTA file.

The input is a protein FASTA file. I suggest to change the error message because it took me a while to dig out the examples that led to the failure. Others may also get stuck on that.

Here is an example that leads to failure (input sequence):

>seqA
XKLAGIDKKLSKSLDQEVVTMMAESPGALSVSPVGPLTDAGSRKTLIYLILTLNHMYPDY
DFSALRGHHFTKENAVPASANFAGALPHIVRVKNDVDGLLLESGKAYEATVGAGAEPLSS
ELWRAIDAVINLVDCDVYTYKAVAEGDPFCDDGNLWSFNYFFYNRKLKRILYFSCRAVSK
TADEESDFEEEFDADADARAMDDSFLADGMEMDDEMY*

If I remove the very first X, then the pipeline runs.

Best wishes,

Katharina

KatharinaHoff commented 2 hours ago

Additional issue (I ran into this when removing the proteins that start with an X with awk in the first attempt):

If a protein sequence is not having line breaks after every 80 characters (maybe it is not 80, but I assume that's probably the threshold), the error message is also that the file is not in protein FASTA format.

KatharinaHoff commented 2 hours ago

And the same problem occurs with short proteins. If a protein sequence does not stretch across at least 2 lines (i.e. must be longer than 80), the pipeline also dies.