IDTAXA is not assigning seqeunces to Kingdom level

Andreas-Bio commented 4 years ago

Okay, so this might be beacuse my database is a bit noisy, but I am a plant person and I have no choice. I am using this database: http://its2.bioapps.biozentrum.uni-wuerzburg.de/

I am getting a lot of assignments like this: [1] "Root [57.2%]; unclassified_Root [57.2%]"

And I am getting a lot of assignments even to kingdom level that do not meet the default confidence level of 60: "Root [52.3%]; Plantae [52.3%]; Magnoliopsida [51.9%]; Rubiaceae [47.9%]; Sherardia [45.8%]; arvensis [45.8%]"

When I check the sequences in BLAST I get a clear assignment, at least to family level. My guess is that some sequences are not correctly annotated and not removed by the IDTAXA learning phase. (i.e. plant sequences assigned to fungi). I tried to debug the function that assignes taxonomy but it's too difficult for me. I followed the tutorial. I tried both options for allowGroupRemoval .

Here is my trained Database (use load() ): https://easyupload.io/pce2tv

> trainingSet
  A training set of class 'Taxa'
   * K-mer size: 7
   * Number of rank levels: 7
   * Total number of sequences: 215043
   * Number of taxonomic groups: 90592
   * Number of problem groups: 1
   * Number of problem sequences: 40

> head(groups)
[1] "Root;Fungi;Eurotiomycetes;Trichocomaceae;Aspergillus;nidulans"       
[2] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;capsici"    
[3] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;caudatum"   
[4] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;capsici"    
[5] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;fuscum"     
[6] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;graminicola"

Here are the two examples from above: GCAGGATCCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTACACCACTCAAGCTATGCTTGGTATTGGGCGTCGTCCTTAGTTGGGCGCGCCTTAAAGACCTCGGCGAGGCCACTCCGGCTTTAGGCGTAGTAGAATTTATTCGAACGTCTGTCAAAGGAGAGGAACTCTGCCGACTGAAACCTTTATTTTTCTAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATA

CCGTGAATCATCGAGTTTTTGAACGCAAGTTGCGCCCGAGGCCACCCGGCCGAGGGCACGTCTGCCTGGGCGTCACGCATCCCGTCGTCGCACCAAGTCTTGCTTGGCGCGGCGGAAGTTGGCCTCCCGTTCCCCCCGCGGCGCGGTTGGCCCAAATGCGAGTCCCCGGACAAGGGACGTCACGACTTCAGGTGGTTGAAATCACTTTCATTCTCGCTCCGAGTCTTGACGATCCCCCTGTGGTTATATAACGACCCTAGAGCCTTCACCGCGCTCACTGGGTTGAGTTCAACGAGATCCTCGAACGCGACCCCGGGTCAGGCGGGATCACCCACTAAGTTTAA

Thank you @digitalwright for allowing to ping your name here. Best wishes, Andreas

digitalwright commented 4 years ago

Thanks for your interest in IDTAXA. The link to your training set did not work for me, but I appreciated you providing lots of information.

I don't know what you mean by "in BLAST I get a clear assignment." Using the BLAST tool on the database you provided results in hits but these hits have low coverage. Nevertheless, it is generally unclear how to directly convert a BLAST hit into a taxonomic assignment.

It looks like the issue is that your sequences are longer than the reference sequences. That is, they extend beyond the region included in the ITS2 database. My guess is that if you trimmed your query sequences to the reference region then the confidences will increase. In particular, the final ~40 nucleotides appear to be missing from the reference sequences.

IDTAXA makes the assumption that the reference sequences are generally full-length, but the query sequences do not need to be full-length. That is, the information in the query should be fully overlapping with the information in the training set.

I hope that helps.

Andreas-Bio commented 4 years ago

Sorry, updated the link.

Ohhh that makes sense! Thanks, wasn't aware of that. However, that assumption is a bit inconvenient. The ITS database is exactly trimmed to ITS2, but all ITS2 primers always get a bit of the flanking sequences. The problem is the flanking sequences can only be removed by ITSx by Bengtsson-Palme which does only run in Linux. I was trying to build a pipeline that runs on all OS. I am on a Windows machine myself and was hoping to get away with it.

I will try to trim and post an update. I am wondering why this hasn't been posted before, as it affects almost all plant and fungi people alike.

Andreas-Bio commented 4 years ago

Of course that was correct. I have to cut the ITS2 sequences. I am very confused why I didn't read that the query must not be longer than the database. Maybe I missed it? I did go thouth the paper and the tutorial pdf.

I also tried SINTAX and it is not affected as much by the flanking sequences (ITS2 flanking sequences are 5.8S rDNA and 26S rDNA), but it takes a hit in identification scores nontheless (especially on species level).

One thing I am missing in SINTAX and IDTAXA is to show the the next best hits. There are many cases where barcodes are shared between species in plants and it would be helpful to get like the top 20 hits. So it's possible to at least identify the species complex, instead of going to genus level automatically. (All species that are at the top at the list and are each sharing the same probability must have the same set of k-mer (+- the bootstrap variation)).

SINTAX (without trimming):

spec.n	spec.s	gen.n	gen.s	fam.n	fam.s	class.n	class.s	king.n	king.s	seq_id	seq
Arabis_verna	0.8464	Arabis	0.92	Brassicaceae	1	Magnoliopsida	1	Plantae	1	76	CCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCTGAAGCCATTAGGCAGAGGGCACGTCTGCCTGGGTGTCACACATCGTTGCCCCAACACCAAATGCCTCGTGCTGCTTGGTGTGTCCGGCGAATGATGACATCCCGTGAGCCCCGCCTCACGGTTTGTTGAAAATTGAGTCCATGGCAGGGTATTCCATGGTGGATGGTGGTTGGGCAATGCTCGAGACCATTCGTGGAAGCTTTATCGTGGCTGGGCTCTGGTATCCCACGTGCGTCGAAATACGCTCACAATGAGACCTCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA
Anthyllis_circinnata	1	Anthyllis	1	Fabaceae	1	Magnoliopsida	1	Plantae	1	76	CCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCTGAAGCCATTAGGCAGAGGGCACGTCTGCCTGGGTGTCACACATCGTTGCCCCAACACCAAATGCCTCGTGCTGCTTGGTGTGTCCGGCGAATGATGACATCCCGTGAGCCCCGCCTCACGGTTTGTTGAAAATTGAGTCCATGGCAGGGTATTCCATGGTGGATGGTGGTTGGGCAATGCTCGAGACCATTCGTGGAAGCTTTATCGTGGCTGGGCTCTGGTATCCCACGTGCGTCGAAATACGCTCACAATGAGACCTCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA
Tordylium_apulum	0.7396	Tordylium	0.86	Apiaceae	1	Magnoliopsida	1	Plantae	1	52757	CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA
Tordylium_apulum	0.6724	Tordylium	0.82	Apiaceae	1	Magnoliopsida	1	Plantae	1	929	CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTGCACCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA
Tordylium_apulum	0.6724	Tordylium	0.82	Apiaceae	1	Magnoliopsida	1	Plantae	1	855	CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGTCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA
Alternaria_tenuissima	0.1932	Alternaria	0.5855	Pleosporaceae	0.6889	Dothideomycetes	0.801	Fungi	0.9	299	GATCCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTTTGGTATTCCAAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAACTTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAG
Schoenus_nigricans	1	Schoenus	1	Cyperaceae	1	Liliopsida	1	Plantae	1	1538	CCGCGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGGATCCGCCCGAGGGCACGCCTGCCTCATGGGCGTTAGAAGCCCATCCACGCTCGGGAGCCTAGCTACTTGGCCAGCCCCGATGCGGATCGTGGCCCTCCGAGCCCTAGGGCGCGGTGGGCCCAAGTGCGCGGCCGTCCGAAGGAGCCGGGAGCGGCGAGTGGTGGAATGCTGCGCGCGCCGTCCCGGGACCCCTGCCGGCATATGGCTTTGTCCGACCCTCGACGAGGAGCCGCGTCGCCTTCGAAAGGAGTGCGGCATTCTCAGATCGATACCCCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA
Solanum_citrullifolium	0.1263	Solanum	0.308	Solanaceae	0.56	Magnoliopsida	1	Plantae	1	79	CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCGTCAGGCCGAGGGCACGTCTGCCTGGGCGTCACGCATCGCGTCGCCCCCCGCACGCCGCTCGGCGTCGCGGGGGCGGATACTGGCCCCCCGTGCGCCCCCCGCGCGCGGCCGGCCTAAATGCGAGCCCGCGCCGACGGACGTCGCGGCGATTGGTGGTTGTATCTCAACTCTCTTCGCGCCGCGGCCGCAGCCCGTCGTGCGTGCGCGCTCCCCGACCCTCAAAGCGCCTCGCGCGCTCCGACCGCGACCCCAGGTCAGGCGGGATTACCCGCTGAGTTTAA
Nemania_serpens	0.6069	Nemania	0.7225	Xylariaceae	0.85	Sordariomycetes	1	Fungi	1	70	CAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCACTAGTATTCTGGTGGGCATGCCTGTTCGAGCGTCATTTCAACCCTTAAGCCCCTGTTGCTTAGCGTTAGGAGCCTACCGGAACTCTCTGGTAGCTCCCCAAAGTCAGTGGCGGAGCCGGTTCGCACTCCAGACGTAGTAGCTTTTACACGTCGCCTGTAGCGCGGGCCGGTCCCCTGCCGTAAAACACCCCAATTTTTATAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA
Rosa_hybrid	0.71	Rosa	1	Rosaceae	1	Magnoliopsida	1	Plantae	1	366	CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAATACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATTCGTCGATGCTTTCAACGCGACCCCAGGTCAGGCGGGGTTACCCGCTGAATTTAA
Rosa_hybrid	0.78	Rosa	1	Rosaceae	1	Magnoliopsida	1	Plantae	1	85	CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAGTACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATCCGTCGATGCTTACAACGCGACCCCAGGTCAGGCGGGGTTACCCGCTGAATTTAA

SINTAX (with trimming):

spec.n	spec.s	gen.n	gen.s	fam.n	fam.s	class.n	class.s	king.n	king.s	seq_id	seq
Arabis_verna	1	Arabis	1	Brassicaceae	1	Magnoliopsida	1	Plantae	1	76	AACGTCGTCCCCATCCTTTTCGGAGAAGGGACGGAAGCTGGTCTCCCGTGTGTTACCGCATGCGGTTGGCTAAAATCCGAGCTGAGGATGCCTTGAGCGTCTCGACATGCGGTGGTGAAATAAAGCCTCGTAATACTGTCGGTCGCTTTTGTCTGAATGCTCTTGATGACCCAACATCCTTAACGCGACCCCAGGTCAGGCGGGATCAC
Schoenus_nigricans	1	Schoenus	1	Cyperaceae	1	Liliopsida	1	Plantae	1	1538	GCCCATCCACGCTCGGGAGCCTAGCTACTTGGCCAGCCCCGATGCGGATCGTGGCCCTCCGAGCCCTAGGGCGCGGTGGGCCCAAGTGCGCGGCCGTCCGAAGGAGCCGGGAGCGGCGAGTGGTGGAATGCTGCGCGCGCCGTCCCGGGACCCCTGCCGGCATATGGCTTTGTCCGACCCTCGACGAGGAGCCGCGTCGCCTTCGAAAGGAGTGCGGCATTCTCAGA
Tordylium_apulum	1	Tordylium	1	Apiaceae	1	Magnoliopsida	1	Plantae	1	855	TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGTCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGA
Tordylium_apulum	1	Tordylium	1	Apiaceae	1	Magnoliopsida	1	Plantae	1	929	TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTGCACCACATTGACTGCTCTTCGA
Tordylium_apulum	1	Tordylium	1	Apiaceae	1	Magnoliopsida	1	Plantae	1	52757	TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGA
Alternaria_tenuissima	0.2137	Alternaria	0.7122	Pleosporaceae	0.8186	Dothideomycetes	0.9409	Fungi	0.97	299	GTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAAC
Solanum_citrullifolium	0.87	Solanum	1	Solanaceae	1	Magnoliopsida	1	Plantae	1	79	ATCGCGTCGCCCCCCGCACGCCGCTCGGCGTCGCGGGGGCGGATACTGGCCCCCCGTGCGCCCCCCGCGCGCGGCCGGCCTAAATGCGAGCCCGCGCCGACGGACGTCGCGGCGATTGGTGGTTGTATCTCAACTCTCTTCGCGCCGCGGCCGCAGCCCGTCGTGCGTGCGCGCTCCCCGACCCTCAAAGCGCCTCGCGCGCTCCGA
Rosa_hybrid	0.2	Rosa	1	Rosaceae	1	Magnoliopsida	1	Plantae	1	366	GTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAATACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATTCGTCGATGCTTTCA
Nemania_serpens	0.512	Nemania	0.64	Xylariaceae	0.8	Sordariomycetes	1	Fungi	1	70	CAACCCTTAAGCCCCTGTTGCTTAGCGTTAGGAGCCTACCGGAACTCTCTGGTAGCTCCCCAAAGTCAGTGGCGGAGCCGGTTCGCACTCCAGACGTAGTAGCTTTTACACGTCGCCTGTAGCGCGGGCCGGTCCCCTGCCGTAAAACACCCCAATTTTTATA
Rosa_hybrid	0.4	Rosa	1	Rosaceae	1	Magnoliopsida	1	Plantae	1	85	GTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAGTACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATCCGTCGATGCTTACA


IDTAXA (with trimming):

100.0%	Plantae	Magnoliopsida	Apiaceae	Tordylium	apulum
86.6%	Plantae	Liliopsida	Cyperaceae	Schoenus	nigricans
76.0%	Plantae	Magnoliopsida	Brassicaceae	Arabis	verna
61.5%	Plantae	Magnoliopsida	Solanaceae	Solanum	citrullifolium
95.0%	Plantae	Magnoliopsida	Apiaceae	Tordylium	apulum
100.0%	Plantae	Magnoliopsida	Apiaceae	Tordylium	apulum
86.1%	Fungi	Dothideomycetes	Pleosporaceae	Alternaria	unclassified_Alternaria
97.0%	Plantae	Magnoliopsida	Rosaceae	Rosa	unclassified_Rosa
93.3%	Plantae	Magnoliopsida	Rosaceae	Rosa	unclassified_Rosa
54.7%	Fungi	Sordariomycetes	Xylariaceae	Nemania	serpens

PS: I don't know how SINTAX does it, but it is possible to do a leave-one-out cross validation with some undocumented commands: sintax_its2_bootstrap_distr_fam_per_spec_new.pdf Just ignore the triangles. The red dots are not correctly assigned and each dot is one sequence (alpha because of overlap). The y-axis shows the confidence level that gets reportes to the user. So basically every red dot above 0.8 is bad. It's easy to check that way how messy your analysis gets if you start to accept lower confidence scores as "truth". Would be great to have that for IDTAXA.

digitalwright commented 4 years ago

The way SINTAX and the RDP Classifier work does not require trimming of the sequences. IDTAXA does because it incorporates distance into its confidence. I will add that information to the documentation in the next release. Thank you for the suggestion.

It is not possible to do quick leave-one-out validation with IDTAXA because of the way the algorithm is constructed. Again, this is different than SINTAX and the RDP Classifier. With IDTAXA you have to remove each sequence, retrain with LearnTaxa(), and test with IdTaxa(). This is because the learning step is non-trivial with IDTAXA and there is no way to simply remove a reference sequence after the fact. However, you could setup a straightforward pipeline to perform relatively quick k-fold cross-validation with all algorithms. This is arguably more realistic than leave-one-out cross-validation anyhow.

Andreas-Bio commented 4 years ago

Thank you for your feedback! I cannot put my finger on it but now that I trimmed all sequences down to ITS2 there is still some unwanted behaviour.

tldr: ITS has pretty big genetic distances. If the species of the query sequence is missing from the database the next best match within the genus has sometime up to 15% genetic distance. This has biological reasons. This seems to confuse IDTAXA into believing that there is nothing closely resembling the query in the database. As a result the assignment score to kingdom level drops below 50 in quite a lot of cases, making the sequence classification worthless for downstream analysis. Is it possible that IDTAXA has been built with 16S in mind which has much lower genetic distances? In that case it would benefit from an exploratory phase in which the algorithm samples some genetic distances to try to guess at which distance one would expect the sequence to belong to another family. I understand you need to make some assumptions for this, but otherwise I cannot see it performing well in plants or other markers with high genetic variation.

its_0.5_raw_distances.pdf

bla bla

This sequence for example gets only a 67 confidence score to "Fungi". It's very different from the next best plant sequence (at least 15% genetic raw distance). I would really like to help you refining the filtering step that removes falsely classified sequences from the database. I have the suspicion there still might be some falsely classified sequences in my database (a plant labeled as fungi or vice versa) causing this problem. Which kind of debugging would you recommend to check where the problem is coming from? Like how can I trace wich sequences are causing this effect? `CACCACTCAAGCTATGCTTGGTATTGGGCGTCGTCCTTAGTTGGGCGCGCCTTAAAGACCTCGGCGAGGCCACTCCGGCTTTAGGCGTAGTAGAATTTATTCGAACGTCTGTCAAAGGAGAGGAACTCTGCCGACTGAAACCTTTATTTTTCTA` here are the sequences from my database I would expect it to match to (query sequence see above): ![grafik](https://user-images.githubusercontent.com/19622117/83752844-78ff5e00-a669-11ea-8dc0-c4604336fcee.png) Now this sequence is another example with a weak assignment to Plantae (score: 49): (should classify as Selginella) `GATCTAAACAGCCCACGGCCCGCACTTCACTGTGGGGGCTGGGGCTCATCTGGCTGTCCGAGGTCCTTGTGACCCGGTCGGCTCTAATTCATGGGAGGGTGCTATGTTGTGCTTCTTGGTTTCACCGCCGCTTGGTTTCAACTGTAGCTCCTCGCCCTTCGGTGTTATCTCTAA` In this case it cannot be contamination because there is no fungi sequence in my database that even closely resembles this sequence. My suspicion is that the algorithm maybe is not able to find diagnostic k-mers in hyperdiverse groups (there are not really a lot of diagnostic k-mers for Selaginellaceae)? Here is another example that has 15% distance to the next plant match but >50% distance to the next fungi match and it still gets a 49 score for Plantae. CCGACGCCTTCTCCCCCCGCCCCGCGCTGCGGCGCGTCGCCGGGGGCGAGCAGTTGGCCGTCCCTGCCCCCTGGGGCGAGGTCGGCCTAAATCCGAGGCCCCTCGGGCTCGCGGCGCGACGATTGGTGGCTCCAAGCTCCCCGGCCTCTTGCCAGGCTCGAAGTCGTGCCCGCGACCCCCCTTGGGAGGCTGCGAGGACCCCTGCCGGCTGCCGCGCCGTCCGTAAGGACCGGAGGCGCGCCGCACGGA The intrageneric distances of ITS can be pretty big. So if the exact species is missing in the database for some reason IDTAXA has trouble assinging the sequence to the next higher level. I understand this is to prevent over-classification, but sometimes it seems to be a little bit over the top.

Here is my trained Database (use load() ): (contains fungi and plant ITS2 sequences) https://easyupload.io/pce2tv

benjjneb / dada2

IDTAXA is not assigning seqeunces to Kingdom level #1029