benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

IDTAXA is not assigning seqeunces to Kingdom level #1029

Closed Andreas-Bio closed 4 years ago

Andreas-Bio commented 4 years ago

Okay, so this might be beacuse my database is a bit noisy, but I am a plant person and I have no choice. I am using this database: http://its2.bioapps.biozentrum.uni-wuerzburg.de/

I am getting a lot of assignments like this: [1] "Root [57.2%]; unclassified_Root [57.2%]"

And I am getting a lot of assignments even to kingdom level that do not meet the default confidence level of 60: "Root [52.3%]; Plantae [52.3%]; Magnoliopsida [51.9%]; Rubiaceae [47.9%]; Sherardia [45.8%]; arvensis [45.8%]"

When I check the sequences in BLAST I get a clear assignment, at least to family level. My guess is that some sequences are not correctly annotated and not removed by the IDTAXA learning phase. (i.e. plant sequences assigned to fungi). I tried to debug the function that assignes taxonomy but it's too difficult for me. I followed the tutorial. I tried both options for allowGroupRemoval .

Here is my trained Database (use load() ): https://easyupload.io/pce2tv

> trainingSet
  A training set of class 'Taxa'
   * K-mer size: 7
   * Number of rank levels: 7
   * Total number of sequences: 215043
   * Number of taxonomic groups: 90592
   * Number of problem groups: 1
   * Number of problem sequences: 40
> head(groups)
[1] "Root;Fungi;Eurotiomycetes;Trichocomaceae;Aspergillus;nidulans"       
[2] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;capsici"    
[3] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;caudatum"   
[4] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;capsici"    
[5] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;fuscum"     
[6] "Root;Fungi;Sordariomycetes;Glomerellaceae;Colletotrichum;graminicola"

Here are the two examples from above: GCAGGATCCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTACACCACTCAAGCTATGCTTGGTATTGGGCGTCGTCCTTAGTTGGGCGCGCCTTAAAGACCTCGGCGAGGCCACTCCGGCTTTAGGCGTAGTAGAATTTATTCGAACGTCTGTCAAAGGAGAGGAACTCTGCCGACTGAAACCTTTATTTTTCTAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATA

CCGTGAATCATCGAGTTTTTGAACGCAAGTTGCGCCCGAGGCCACCCGGCCGAGGGCACGTCTGCCTGGGCGTCACGCATCCCGTCGTCGCACCAAGTCTTGCTTGGCGCGGCGGAAGTTGGCCTCCCGTTCCCCCCGCGGCGCGGTTGGCCCAAATGCGAGTCCCCGGACAAGGGACGTCACGACTTCAGGTGGTTGAAATCACTTTCATTCTCGCTCCGAGTCTTGACGATCCCCCTGTGGTTATATAACGACCCTAGAGCCTTCACCGCGCTCACTGGGTTGAGTTCAACGAGATCCTCGAACGCGACCCCGGGTCAGGCGGGATCACCCACTAAGTTTAA

Thank you @digitalwright for allowing to ping your name here. Best wishes, Andreas

digitalwright commented 4 years ago

Thanks for your interest in IDTAXA. The link to your training set did not work for me, but I appreciated you providing lots of information.

I don't know what you mean by "in BLAST I get a clear assignment." Using the BLAST tool on the database you provided results in hits but these hits have low coverage. Nevertheless, it is generally unclear how to directly convert a BLAST hit into a taxonomic assignment.

It looks like the issue is that your sequences are longer than the reference sequences. That is, they extend beyond the region included in the ITS2 database. My guess is that if you trimmed your query sequences to the reference region then the confidences will increase. In particular, the final ~40 nucleotides appear to be missing from the reference sequences.

IDTAXA makes the assumption that the reference sequences are generally full-length, but the query sequences do not need to be full-length. That is, the information in the query should be fully overlapping with the information in the training set.

I hope that helps.

Andreas-Bio commented 4 years ago

Sorry, updated the link.

Ohhh that makes sense! Thanks, wasn't aware of that. However, that assumption is a bit inconvenient. The ITS database is exactly trimmed to ITS2, but all ITS2 primers always get a bit of the flanking sequences. The problem is the flanking sequences can only be removed by ITSx by Bengtsson-Palme which does only run in Linux. I was trying to build a pipeline that runs on all OS. I am on a Windows machine myself and was hoping to get away with it.

I will try to trim and post an update. I am wondering why this hasn't been posted before, as it affects almost all plant and fungi people alike.

Andreas-Bio commented 4 years ago

Of course that was correct. I have to cut the ITS2 sequences. I am very confused why I didn't read that the query must not be longer than the database. Maybe I missed it? I did go thouth the paper and the tutorial pdf.

I also tried SINTAX and it is not affected as much by the flanking sequences (ITS2 flanking sequences are 5.8S rDNA and 26S rDNA), but it takes a hit in identification scores nontheless (especially on species level).

One thing I am missing in SINTAX and IDTAXA is to show the the next best hits. There are many cases where barcodes are shared between species in plants and it would be helpful to get like the top 20 hits. So it's possible to at least identify the species complex, instead of going to genus level automatically. (All species that are at the top at the list and are each sharing the same probability must have the same set of k-mer (+- the bootstrap variation)).

SINTAX (without trimming):                      
                       
spec.n spec.s gen.n gen.s fam.n fam.s class.n class.s king.n king.s seq_id seq
Arabis_verna 0.8464 Arabis 0.92 Brassicaceae 1 Magnoliopsida 1 Plantae 1 76 CCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCTGAAGCCATTAGGCAGAGGGCACGTCTGCCTGGGTGTCACACATCGTTGCCCCAACACCAAATGCCTCGTGCTGCTTGGTGTGTCCGGCGAATGATGACATCCCGTGAGCCCCGCCTCACGGTTTGTTGAAAATTGAGTCCATGGCAGGGTATTCCATGGTGGATGGTGGTTGGGCAATGCTCGAGACCATTCGTGGAAGCTTTATCGTGGCTGGGCTCTGGTATCCCACGTGCGTCGAAATACGCTCACAATGAGACCTCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA
Anthyllis_circinnata 1 Anthyllis 1 Fabaceae 1 Magnoliopsida 1 Plantae 1 76 CCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCTGAAGCCATTAGGCAGAGGGCACGTCTGCCTGGGTGTCACACATCGTTGCCCCAACACCAAATGCCTCGTGCTGCTTGGTGTGTCCGGCGAATGATGACATCCCGTGAGCCCCGCCTCACGGTTTGTTGAAAATTGAGTCCATGGCAGGGTATTCCATGGTGGATGGTGGTTGGGCAATGCTCGAGACCATTCGTGGAAGCTTTATCGTGGCTGGGCTCTGGTATCCCACGTGCGTCGAAATACGCTCACAATGAGACCTCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA
Tordylium_apulum 0.7396 Tordylium 0.86 Apiaceae 1 Magnoliopsida 1 Plantae 1 52757 CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA
Tordylium_apulum 0.6724 Tordylium 0.82 Apiaceae 1 Magnoliopsida 1 Plantae 1 929 CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTGCACCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA
Tordylium_apulum 0.6724 Tordylium 0.82 Apiaceae 1 Magnoliopsida 1 Plantae 1 855 CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGTCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA
Alternaria_tenuissima 0.1932 Alternaria 0.5855 Pleosporaceae 0.6889 Dothideomycetes 0.801 Fungi 0.9 299 GATCCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTTTGGTATTCCAAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAACTTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAG
Schoenus_nigricans 1 Schoenus 1 Cyperaceae 1 Liliopsida 1 Plantae 1 1538 CCGCGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGGATCCGCCCGAGGGCACGCCTGCCTCATGGGCGTTAGAAGCCCATCCACGCTCGGGAGCCTAGCTACTTGGCCAGCCCCGATGCGGATCGTGGCCCTCCGAGCCCTAGGGCGCGGTGGGCCCAAGTGCGCGGCCGTCCGAAGGAGCCGGGAGCGGCGAGTGGTGGAATGCTGCGCGCGCCGTCCCGGGACCCCTGCCGGCATATGGCTTTGTCCGACCCTCGACGAGGAGCCGCGTCGCCTTCGAAAGGAGTGCGGCATTCTCAGATCGATACCCCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA
Solanum_citrullifolium 0.1263 Solanum 0.308 Solanaceae 0.56 Magnoliopsida 1 Plantae 1 79 CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCGTCAGGCCGAGGGCACGTCTGCCTGGGCGTCACGCATCGCGTCGCCCCCCGCACGCCGCTCGGCGTCGCGGGGGCGGATACTGGCCCCCCGTGCGCCCCCCGCGCGCGGCCGGCCTAAATGCGAGCCCGCGCCGACGGACGTCGCGGCGATTGGTGGTTGTATCTCAACTCTCTTCGCGCCGCGGCCGCAGCCCGTCGTGCGTGCGCGCTCCCCGACCCTCAAAGCGCCTCGCGCGCTCCGACCGCGACCCCAGGTCAGGCGGGATTACCCGCTGAGTTTAA
Nemania_serpens 0.6069 Nemania 0.7225 Xylariaceae 0.85 Sordariomycetes 1 Fungi 1 70 CAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCACTAGTATTCTGGTGGGCATGCCTGTTCGAGCGTCATTTCAACCCTTAAGCCCCTGTTGCTTAGCGTTAGGAGCCTACCGGAACTCTCTGGTAGCTCCCCAAAGTCAGTGGCGGAGCCGGTTCGCACTCCAGACGTAGTAGCTTTTACACGTCGCCTGTAGCGCGGGCCGGTCCCCTGCCGTAAAACACCCCAATTTTTATAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA
Rosa_hybrid 0.71 Rosa 1 Rosaceae 1 Magnoliopsida 1 Plantae 1 366 CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAATACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATTCGTCGATGCTTTCAACGCGACCCCAGGTCAGGCGGGGTTACCCGCTGAATTTAA
Rosa_hybrid 0.78 Rosa 1 Rosaceae 1 Magnoliopsida 1 Plantae 1 85 CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAGTACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATCCGTCGATGCTTACAACGCGACCCCAGGTCAGGCGGGGTTACCCGCTGAATTTAA
                       
SINTAX (with trimming):                      
                       
spec.n spec.s gen.n gen.s fam.n fam.s class.n class.s king.n king.s seq_id seq
Arabis_verna 1 Arabis 1 Brassicaceae 1 Magnoliopsida 1 Plantae 1 76 AACGTCGTCCCCATCCTTTTCGGAGAAGGGACGGAAGCTGGTCTCCCGTGTGTTACCGCATGCGGTTGGCTAAAATCCGAGCTGAGGATGCCTTGAGCGTCTCGACATGCGGTGGTGAAATAAAGCCTCGTAATACTGTCGGTCGCTTTTGTCTGAATGCTCTTGATGACCCAACATCCTTAACGCGACCCCAGGTCAGGCGGGATCAC
Schoenus_nigricans 1 Schoenus 1 Cyperaceae 1 Liliopsida 1 Plantae 1 1538 GCCCATCCACGCTCGGGAGCCTAGCTACTTGGCCAGCCCCGATGCGGATCGTGGCCCTCCGAGCCCTAGGGCGCGGTGGGCCCAAGTGCGCGGCCGTCCGAAGGAGCCGGGAGCGGCGAGTGGTGGAATGCTGCGCGCGCCGTCCCGGGACCCCTGCCGGCATATGGCTTTGTCCGACCCTCGACGAGGAGCCGCGTCGCCTTCGAAAGGAGTGCGGCATTCTCAGA
Tordylium_apulum 1 Tordylium 1 Apiaceae 1 Magnoliopsida 1 Plantae 1 855 TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGTCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGA
Tordylium_apulum 1 Tordylium 1 Apiaceae 1 Magnoliopsida 1 Plantae 1 929 TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTGCACCACATTGACTGCTCTTCGA
Tordylium_apulum 1 Tordylium 1 Apiaceae 1 Magnoliopsida 1 Plantae 1 52757 TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGA
Alternaria_tenuissima 0.2137 Alternaria 0.7122 Pleosporaceae 0.8186 Dothideomycetes 0.9409 Fungi 0.97 299 GTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAAC
Solanum_citrullifolium 0.87 Solanum 1 Solanaceae 1 Magnoliopsida 1 Plantae 1 79 ATCGCGTCGCCCCCCGCACGCCGCTCGGCGTCGCGGGGGCGGATACTGGCCCCCCGTGCGCCCCCCGCGCGCGGCCGGCCTAAATGCGAGCCCGCGCCGACGGACGTCGCGGCGATTGGTGGTTGTATCTCAACTCTCTTCGCGCCGCGGCCGCAGCCCGTCGTGCGTGCGCGCTCCCCGACCCTCAAAGCGCCTCGCGCGCTCCGA
Rosa_hybrid 0.2 Rosa 1 Rosaceae 1 Magnoliopsida 1 Plantae 1 366 GTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAATACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATTCGTCGATGCTTTCA
Nemania_serpens 0.512 Nemania 0.64 Xylariaceae 0.8 Sordariomycetes 1 Fungi 1 70 CAACCCTTAAGCCCCTGTTGCTTAGCGTTAGGAGCCTACCGGAACTCTCTGGTAGCTCCCCAAAGTCAGTGGCGGAGCCGGTTCGCACTCCAGACGTAGTAGCTTTTACACGTCGCCTGTAGCGCGGGCCGGTCCCCTGCCGTAAAACACCCCAATTTTTATA
Rosa_hybrid 0.4 Rosa 1 Rosaceae 1 Magnoliopsida 1 Plantae 1 85 GTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAGTACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATCCGTCGATGCTTACA
                       
                       
IDTAXA (with trimming):                      
                       
100.0% Plantae Magnoliopsida Apiaceae Tordylium apulum          
86.6% Plantae Liliopsida Cyperaceae Schoenus nigricans          
76.0% Plantae Magnoliopsida Brassicaceae Arabis verna          
61.5% Plantae Magnoliopsida Solanaceae Solanum citrullifolium          
95.0% Plantae Magnoliopsida Apiaceae Tordylium apulum          
100.0% Plantae Magnoliopsida Apiaceae Tordylium apulum          
86.1% Fungi Dothideomycetes Pleosporaceae Alternaria unclassified_Alternaria        
 97.0% Plantae Magnoliopsida Rosaceae Rosa unclassified_Rosa        
93.3% Plantae Magnoliopsida Rosaceae Rosa unclassified_Rosa        
54.7% Fungi Sordariomycetes Xylariaceae Nemania serpens            

PS: I don't know how SINTAX does it, but it is possible to do a leave-one-out cross validation with some undocumented commands: sintax_its2_bootstrap_distr_fam_per_spec_new.pdf Just ignore the triangles. The red dots are not correctly assigned and each dot is one sequence (alpha because of overlap). The y-axis shows the confidence level that gets reportes to the user. So basically every red dot above 0.8 is bad. It's easy to check that way how messy your analysis gets if you start to accept lower confidence scores as "truth". Would be great to have that for IDTAXA.

digitalwright commented 4 years ago

The way SINTAX and the RDP Classifier work does not require trimming of the sequences. IDTAXA does because it incorporates distance into its confidence. I will add that information to the documentation in the next release. Thank you for the suggestion.

It is not possible to do quick leave-one-out validation with IDTAXA because of the way the algorithm is constructed. Again, this is different than SINTAX and the RDP Classifier. With IDTAXA you have to remove each sequence, retrain with LearnTaxa(), and test with IdTaxa(). This is because the learning step is non-trivial with IDTAXA and there is no way to simply remove a reference sequence after the fact. However, you could setup a straightforward pipeline to perform relatively quick k-fold cross-validation with all algorithms. This is arguably more realistic than leave-one-out cross-validation anyhow.

Andreas-Bio commented 4 years ago

Thank you for your feedback! I cannot put my finger on it but now that I trimmed all sequences down to ITS2 there is still some unwanted behaviour.

tldr: ITS has pretty big genetic distances. If the species of the query sequence is missing from the database the next best match within the genus has sometime up to 15% genetic distance. This has biological reasons. This seems to confuse IDTAXA into believing that there is nothing closely resembling the query in the database. As a result the assignment score to kingdom level drops below 50 in quite a lot of cases, making the sequence classification worthless for downstream analysis. Is it possible that IDTAXA has been built with 16S in mind which has much lower genetic distances? In that case it would benefit from an exploratory phase in which the algorithm samples some genetic distances to try to guess at which distance one would expect the sequence to belong to another family. I understand you need to make some assumptions for this, but otherwise I cannot see it performing well in plants or other markers with high genetic variation.

its_0.5_raw_distances.pdf

bla bla This sequence for example gets only a 67 confidence score to "Fungi". It's very different from the next best plant sequence (at least 15% genetic raw distance). I would really like to help you refining the filtering step that removes falsely classified sequences from the database. I have the suspicion there still might be some falsely classified sequences in my database (a plant labeled as fungi or vice versa) causing this problem. Which kind of debugging would you recommend to check where the problem is coming from? Like how can I trace wich sequences are causing this effect? `CACCACTCAAGCTATGCTTGGTATTGGGCGTCGTCCTTAGTTGGGCGCGCCTTAAAGACCTCGGCGAGGCCACTCCGGCTTTAGGCGTAGTAGAATTTATTCGAACGTCTGTCAAAGGAGAGGAACTCTGCCGACTGAAACCTTTATTTTTCTA` here are the sequences from my database I would expect it to match to (query sequence see above): ![grafik](https://user-images.githubusercontent.com/19622117/83752844-78ff5e00-a669-11ea-8dc0-c4604336fcee.png) Now this sequence is another example with a weak assignment to Plantae (score: 49): (should classify as Selginella) `GATCTAAACAGCCCACGGCCCGCACTTCACTGTGGGGGCTGGGGCTCATCTGGCTGTCCGAGGTCCTTGTGACCCGGTCGGCTCTAATTCATGGGAGGGTGCTATGTTGTGCTTCTTGGTTTCACCGCCGCTTGGTTTCAACTGTAGCTCCTCGCCCTTCGGTGTTATCTCTAA` In this case it cannot be contamination because there is no fungi sequence in my database that even closely resembles this sequence. My suspicion is that the algorithm maybe is not able to find diagnostic k-mers in hyperdiverse groups (there are not really a lot of diagnostic k-mers for Selaginellaceae)? Here is another example that has 15% distance to the next plant match but >50% distance to the next fungi match and it still gets a 49 score for Plantae. CCGACGCCTTCTCCCCCCGCCCCGCGCTGCGGCGCGTCGCCGGGGGCGAGCAGTTGGCCGTCCCTGCCCCCTGGGGCGAGGTCGGCCTAAATCCGAGGCCCCTCGGGCTCGCGGCGCGACGATTGGTGGCTCCAAGCTCCCCGGCCTCTTGCCAGGCTCGAAGTCGTGCCCGCGACCCCCCTTGGGAGGCTGCGAGGACCCCTGCCGGCTGCCGCGCCGTCCGTAAGGACCGGAGGCGCGCCGCACGGA The intrageneric distances of ITS can be pretty big. So if the exact species is missing in the database for some reason IDTAXA has trouble assinging the sequence to the next higher level. I understand this is to prevent over-classification, but sometimes it seems to be a little bit over the top.

Here is my trained Database (use load() ): (contains fungi and plant ITS2 sequences) https://easyupload.io/pce2tv