Closed jianshu93 closed 1 year ago
Hi @jianshu93,
this happens when you have a genome that contains regions with unknown nucleotides (stretches of N
) and Prodigal/Pyrodigal finds a start codon on one side and a stop codon on the other -- in this case the gene sequence cannot be resolved so the amino acids emitted are just X
.
If you don't want genes to cross unknown regions, there is an option in Prodigal (prodigal -m
) and in Pyrodigal (pyrodigal.OrfFinder(mask=True)
.
Thanks! It is so rare and I never pay attention to this problem. I have only 2 cases out of 300 thousand genomes. I will always mask unknown region. This should be the default option I feel. Why do we care if it is unknown? Yes we know a gene there,but all NNN will not tell anything else right.
Thanks,
Jianshu
I agree! I also ended up having issues in downstream HMMER analysis because of this, where this could cause a very long gene (>100.000 aa) to be predicted, and crash the HMMER pipeline.
Note also that even when on mask=True
mode, there is a minimum number of bases that must be unknown for Prodigal/Pyrodigal to stop predicting across them, so if you sequence is AAAAANAAA
you will stil get a protein sequence of KXK
. I really see no reason not to keep region masking enabled at all times :smile:
Dear Martin,
In some very rare cases, I have the following aa output of prodigal:
null_strange_aa_seq_XXX.zip
I attached the genomes.
Thanks,
Jianshu