Taxonomy assignment - Githubissues

davidvilanova commented 7 years ago

Hi, I have the following DNAsequence that cannot be properly assigned with Kaiju. Ncbi blast returns a staphylococcus epidermidis. I have used Kaiju web server with default parameters and Nr+eu database. I hava only submitted this sequence.

Thanks ,

>k141_447_50 # 42411 # 43247 # -1 # ID=423_50;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.323
ATGGTGAATTATAAAGAGAAGTTTGCAGAAGCCAAGACAATTGCCGTAAATGAGGGGTTTGAATCAACTC
GTGCCGAATGGTTGTTTTTAGATGTTTTTGGTTGGTCGAAAACAGATTATTTAATTCATAAAGATGAGCA
AATGTCTTTGACATCAATTAACAAATTGGATAAAGCGTTGGATAGAATGATCACAGGAGAACCTATTCAA
TACATTGTTGGATTTCAGTCTTTTTATGGTTATCAATATAAAGTGAATCAACACTGTCTTATACCAAGGC
CTGAAACCGAGGAAGTTATGTTGCATTTTTTAGAATTGTGTAAAAAGACTGATACCATAGCAGATATTGG
AACTGGAAGTGGTGCTATAGCAATTACGCTTAAGTTACTGCAACCTGAATTAAATGTTATTGCAACAGAT
TTGTATGAAGATGCTTTAAATGTAGCTAAGCAAAATGCTAGTCATTATCACCAAAATATTCAGTTTTTGC
GTGGAAATGCTTTAAAACCGCTAATTGAAAATGATATAAAATTGGATGGGCTGATATCTAATCCACCATA
CATAGGCCATAGTGAAATAATAGATATGGAGTCAACAGTACTAAATTATGAGCCACATCATGCTCTATTT
GCTGAGAAAAACGGATTTGCTATTTATGAGTCAATATTAGAAGATTTACCATTTGTAATGAAACAAGGTG
GACATGTTGTTTTTGAAATAGGTTATAGTCAAGGAGATATCTTAAAAAGAATGATTCAAGATTTATATCC
TGAAAAAGAAGTAGAGATTTTCAAAGATATCAATGGAAATCAGCGTATTATATCTATTATTTGGTAG

pmenzel commented 7 years ago

Hi, what exactly do you mean with "properly assign" ? I just tried the seq on the web server and it is assigned to Terrabacteria group.

In case you are wondering why Terrabacteria: If you blastx the sequence to NR on the NCBI server and expand the entry of the first hit (MULTISPECIES: protein-(glutamine-N5) methyltransferase, release factor-specific [Staphylococcus]), then you will see that this sequence is mostly assigned to Staphylococcus but also to Mycobacterium. The LCA of those is Terrabacteria group.

davidvilanova commented 7 years ago

Thanks Dr menzel, The first hit of the ncbi blastx result shows 99% staph and 1% Mycobacteria. I think LCA is too stringent and therefore translates in too many unresolved sequences at genus, species level. For that one i would assign Staph so is there any way to tune the LCA algorithm to be less stringent ??? The main concern is that i get a lot of unknow sequences or unclassified and i think it's because the algorithm is too stringent.

pmenzel commented 7 years ago

Hi,

I can understand what you mean, but there is no simple answer to that problem.

In that particular case, it could as well be a wrong annotation to Mycobacterium and it should really be only assigned to Staph.

However, in a general sense, your observation reflects the issue of composition bias in the reference databases, which is heavily skewed towards certain genera, e.g. human pathogens and easily culturable species (see also Suppl. Fig. 4 in the Kaiju paper). Therefore, I would not treat that gene different, just because more copies belonging to Staph are in the database. If tomorrow somebody would suddenly add another 100 Mycobacterium genomes containing that gene to NCBI, then what would you do? Assign the sequence again to Terrabacteria or still to Staph?

So by using the LCA, one is at least on the safe side, even so the read is assigned to a higher taxonomic level. If you want to bring it down to lower levels, using blastn will probably often help to find a more specific nucleotide sequence.

Btw, if a sequence is unclassified in Kaiju, then it has no match to the database whatsoever given the chosen thresholds (length / score), independent of the LCA calculation.

davidvilanova commented 7 years ago

Peter, I think what's happening is the databases will increase over time. Specially in the metagenomics fields. What could happen is that over time you will end up assigning reads to the Bacteria group which is not the desired purpose of the tool. I think the LCA approach is good , however it should incorporate some sort of fine tuning parameter (maybe abundance) that would help to cleanup nodes that might not be relevant (or at least that you know you have removed).

pmenzel commented 7 years ago

Yes I agree, once the reference database is fully resolved, then most reads will be assigned to high levels in the tree (if still using short reads) and only few reads will be species-specific. But in that scenario, you would not need to use protein sequence comparison any more and use nucleotide sequences.

bioinformatics-centre / kaiju

Taxonomy assignment #46