Closed Andreas-Bio closed 4 years ago
Thanks for your interest in IDTAXA. The link to your training set did not work for me, but I appreciated you providing lots of information.
I don't know what you mean by "in BLAST I get a clear assignment." Using the BLAST tool on the database you provided results in hits but these hits have low coverage. Nevertheless, it is generally unclear how to directly convert a BLAST hit into a taxonomic assignment.
It looks like the issue is that your sequences are longer than the reference sequences. That is, they extend beyond the region included in the ITS2 database. My guess is that if you trimmed your query sequences to the reference region then the confidences will increase. In particular, the final ~40 nucleotides appear to be missing from the reference sequences.
IDTAXA makes the assumption that the reference sequences are generally full-length, but the query sequences do not need to be full-length. That is, the information in the query should be fully overlapping with the information in the training set.
I hope that helps.
Sorry, updated the link.
Ohhh that makes sense! Thanks, wasn't aware of that. However, that assumption is a bit inconvenient. The ITS database is exactly trimmed to ITS2, but all ITS2 primers always get a bit of the flanking sequences. The problem is the flanking sequences can only be removed by ITSx by Bengtsson-Palme which does only run in Linux. I was trying to build a pipeline that runs on all OS. I am on a Windows machine myself and was hoping to get away with it.
I will try to trim and post an update. I am wondering why this hasn't been posted before, as it affects almost all plant and fungi people alike.
Of course that was correct. I have to cut the ITS2 sequences. I am very confused why I didn't read that the query must not be longer than the database. Maybe I missed it? I did go thouth the paper and the tutorial pdf.
I also tried SINTAX and it is not affected as much by the flanking sequences (ITS2 flanking sequences are 5.8S rDNA and 26S rDNA), but it takes a hit in identification scores nontheless (especially on species level).
One thing I am missing in SINTAX and IDTAXA is to show the the next best hits. There are many cases where barcodes are shared between species in plants and it would be helpful to get like the top 20 hits. So it's possible to at least identify the species complex, instead of going to genus level automatically. (All species that are at the top at the list and are each sharing the same probability must have the same set of k-mer (+- the bootstrap variation)).
SINTAX (without trimming): | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
spec.n | spec.s | gen.n | gen.s | fam.n | fam.s | class.n | class.s | king.n | king.s | seq_id | seq |
Arabis_verna | 0.8464 | Arabis | 0.92 | Brassicaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 76 | CCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCTGAAGCCATTAGGCAGAGGGCACGTCTGCCTGGGTGTCACACATCGTTGCCCCAACACCAAATGCCTCGTGCTGCTTGGTGTGTCCGGCGAATGATGACATCCCGTGAGCCCCGCCTCACGGTTTGTTGAAAATTGAGTCCATGGCAGGGTATTCCATGGTGGATGGTGGTTGGGCAATGCTCGAGACCATTCGTGGAAGCTTTATCGTGGCTGGGCTCTGGTATCCCACGTGCGTCGAAATACGCTCACAATGAGACCTCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA |
Anthyllis_circinnata | 1 | Anthyllis | 1 | Fabaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 76 | CCGTGAACCATCGAGTTTTTGAACGCAAGTTGCGCCTGAAGCCATTAGGCAGAGGGCACGTCTGCCTGGGTGTCACACATCGTTGCCCCAACACCAAATGCCTCGTGCTGCTTGGTGTGTCCGGCGAATGATGACATCCCGTGAGCCCCGCCTCACGGTTTGTTGAAAATTGAGTCCATGGCAGGGTATTCCATGGTGGATGGTGGTTGGGCAATGCTCGAGACCATTCGTGGAAGCTTTATCGTGGCTGGGCTCTGGTATCCCACGTGCGTCGAAATACGCTCACAATGAGACCTCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA |
Tordylium_apulum | 0.7396 | Tordylium | 0.86 | Apiaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 52757 | CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA |
Tordylium_apulum | 0.6724 | Tordylium | 0.82 | Apiaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 929 | CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTGCACCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA |
Tordylium_apulum | 0.6724 | Tordylium | 0.82 | Apiaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 855 | CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCTGAGGGCACGTCTGCCTGGGTGTCACGCTTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGTCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGATTGTGACCCCAGGTCAGGCGGGACTACCCGCTGAGTTTAA |
Alternaria_tenuissima | 0.1932 | Alternaria | 0.5855 | Pleosporaceae | 0.6889 | Dothideomycetes | 0.801 | Fungi | 0.9 | 299 | GATCCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTTTGGTATTCCAAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAACTTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAG |
Schoenus_nigricans | 1 | Schoenus | 1 | Cyperaceae | 1 | Liliopsida | 1 | Plantae | 1 | 1538 | CCGCGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGGATCCGCCCGAGGGCACGCCTGCCTCATGGGCGTTAGAAGCCCATCCACGCTCGGGAGCCTAGCTACTTGGCCAGCCCCGATGCGGATCGTGGCCCTCCGAGCCCTAGGGCGCGGTGGGCCCAAGTGCGCGGCCGTCCGAAGGAGCCGGGAGCGGCGAGTGGTGGAATGCTGCGCGCGCCGTCCCGGGACCCCTGCCGGCATATGGCTTTGTCCGACCCTCGACGAGGAGCCGCGTCGCCTTCGAAAGGAGTGCGGCATTCTCAGATCGATACCCCAGGTCAGGCGGGGCTACCCGCTGAGTTTAA |
Solanum_citrullifolium | 0.1263 | Solanum | 0.308 | Solanaceae | 0.56 | Magnoliopsida | 1 | Plantae | 1 | 79 | CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCGTCAGGCCGAGGGCACGTCTGCCTGGGCGTCACGCATCGCGTCGCCCCCCGCACGCCGCTCGGCGTCGCGGGGGCGGATACTGGCCCCCCGTGCGCCCCCCGCGCGCGGCCGGCCTAAATGCGAGCCCGCGCCGACGGACGTCGCGGCGATTGGTGGTTGTATCTCAACTCTCTTCGCGCCGCGGCCGCAGCCCGTCGTGCGTGCGCGCTCCCCGACCCTCAAAGCGCCTCGCGCGCTCCGACCGCGACCCCAGGTCAGGCGGGATTACCCGCTGAGTTTAA |
Nemania_serpens | 0.6069 | Nemania | 0.7225 | Xylariaceae | 0.85 | Sordariomycetes | 1 | Fungi | 1 | 70 | CAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCACTAGTATTCTGGTGGGCATGCCTGTTCGAGCGTCATTTCAACCCTTAAGCCCCTGTTGCTTAGCGTTAGGAGCCTACCGGAACTCTCTGGTAGCTCCCCAAAGTCAGTGGCGGAGCCGGTTCGCACTCCAGACGTAGTAGCTTTTACACGTCGCCTGTAGCGCGGGCCGGTCCCCTGCCGTAAAACACCCCAATTTTTATAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA |
Rosa_hybrid | 0.71 | Rosa | 1 | Rosaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 366 | CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAATACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATTCGTCGATGCTTTCAACGCGACCCCAGGTCAGGCGGGGTTACCCGCTGAATTTAA |
Rosa_hybrid | 0.78 | Rosa | 1 | Rosaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 85 | CCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAAGCCATTAGGCCGAGGGCACGTCTGCCTGGGCGTCACACGTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAGTACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATCCGTCGATGCTTACAACGCGACCCCAGGTCAGGCGGGGTTACCCGCTGAATTTAA |
SINTAX (with trimming): | |||||||||||
spec.n | spec.s | gen.n | gen.s | fam.n | fam.s | class.n | class.s | king.n | king.s | seq_id | seq |
Arabis_verna | 1 | Arabis | 1 | Brassicaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 76 | AACGTCGTCCCCATCCTTTTCGGAGAAGGGACGGAAGCTGGTCTCCCGTGTGTTACCGCATGCGGTTGGCTAAAATCCGAGCTGAGGATGCCTTGAGCGTCTCGACATGCGGTGGTGAAATAAAGCCTCGTAATACTGTCGGTCGCTTTTGTCTGAATGCTCTTGATGACCCAACATCCTTAACGCGACCCCAGGTCAGGCGGGATCAC |
Schoenus_nigricans | 1 | Schoenus | 1 | Cyperaceae | 1 | Liliopsida | 1 | Plantae | 1 | 1538 | GCCCATCCACGCTCGGGAGCCTAGCTACTTGGCCAGCCCCGATGCGGATCGTGGCCCTCCGAGCCCTAGGGCGCGGTGGGCCCAAGTGCGCGGCCGTCCGAAGGAGCCGGGAGCGGCGAGTGGTGGAATGCTGCGCGCGCCGTCCCGGGACCCCTGCCGGCATATGGCTTTGTCCGACCCTCGACGAGGAGCCGCGTCGCCTTCGAAAGGAGTGCGGCATTCTCAGA |
Tordylium_apulum | 1 | Tordylium | 1 | Apiaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 855 | TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGTCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGA |
Tordylium_apulum | 1 | Tordylium | 1 | Apiaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 929 | TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTGCACCACATTGACTGCTCTTCGA |
Tordylium_apulum | 1 | Tordylium | 1 | Apiaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 52757 | TTTGACTTGCCCCCAACTACACACTCCTTGAGGAGCTGTGCTTGTTTGGGGGCGGAAACTGGCCTCCCGTGCTTCTTGCGCGGTTGGCAAAAAAGCGAGTCTCCGGCTACGGACGCCGTGACATTGGTGGTTGTAAAGACCTTCTTGTATTGTCGGGCGTATCCGGGCCATCCTAGCGAGCTCCAGGACCCTTAGGTGCAGCCACATTGACTGCTCTTCGA |
Alternaria_tenuissima | 0.2137 | Alternaria | 0.7122 | Pleosporaceae | 0.8186 | Dothideomycetes | 0.9409 | Fungi | 0.97 | 299 | GTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAAC |
Solanum_citrullifolium | 0.87 | Solanum | 1 | Solanaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 79 | ATCGCGTCGCCCCCCGCACGCCGCTCGGCGTCGCGGGGGCGGATACTGGCCCCCCGTGCGCCCCCCGCGCGCGGCCGGCCTAAATGCGAGCCCGCGCCGACGGACGTCGCGGCGATTGGTGGTTGTATCTCAACTCTCTTCGCGCCGCGGCCGCAGCCCGTCGTGCGTGCGCGCTCCCCGACCCTCAAAGCGCCTCGCGCGCTCCGA |
Rosa_hybrid | 0.2 | Rosa | 1 | Rosaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 366 | GTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAATACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATTCGTCGATGCTTTCA |
Nemania_serpens | 0.512 | Nemania | 0.64 | Xylariaceae | 0.8 | Sordariomycetes | 1 | Fungi | 1 | 70 | CAACCCTTAAGCCCCTGTTGCTTAGCGTTAGGAGCCTACCGGAACTCTCTGGTAGCTCCCCAAAGTCAGTGGCGGAGCCGGTTCGCACTCCAGACGTAGTAGCTTTTACACGTCGCCTGTAGCGCGGGCCGGTCCCCTGCCGTAAAACACCCCAATTTTTATA |
Rosa_hybrid | 0.4 | Rosa | 1 | Rosaceae | 1 | Magnoliopsida | 1 | Plantae | 1 | 85 | GTCGTTGCCCCCCCCCAACCCCCTCGGGAGTTGGATGGGACGGATGATGGCCTCCCGTGTGCTCAGTCACGCGGTTGGCATAAGTACCAAGTCCTCGGCGACCAACGCCACGACAATCGGTGGTTGTCAAACCTCGGTTTCCTGTCGTGCGCGCGTGTTGATCGAGTGCTTTCTTAAACAATGCGTGTCGATCCGTCGATGCTTACA |
IDTAXA (with trimming): | |||||||||||
100.0% | Plantae | Magnoliopsida | Apiaceae | Tordylium | apulum | ||||||
86.6% | Plantae | Liliopsida | Cyperaceae | Schoenus | nigricans | ||||||
76.0% | Plantae | Magnoliopsida | Brassicaceae | Arabis | verna | ||||||
61.5% | Plantae | Magnoliopsida | Solanaceae | Solanum | citrullifolium | ||||||
95.0% | Plantae | Magnoliopsida | Apiaceae | Tordylium | apulum | ||||||
100.0% | Plantae | Magnoliopsida | Apiaceae | Tordylium | apulum | ||||||
86.1% | Fungi | Dothideomycetes | Pleosporaceae | Alternaria | unclassified_Alternaria | ||||||
97.0% | Plantae | Magnoliopsida | Rosaceae | Rosa | unclassified_Rosa | ||||||
93.3% | Plantae | Magnoliopsida | Rosaceae | Rosa | unclassified_Rosa | ||||||
54.7% | Fungi | Sordariomycetes | Xylariaceae | Nemania | serpens |
PS: I don't know how SINTAX does it, but it is possible to do a leave-one-out cross validation with some undocumented commands: sintax_its2_bootstrap_distr_fam_per_spec_new.pdf Just ignore the triangles. The red dots are not correctly assigned and each dot is one sequence (alpha because of overlap). The y-axis shows the confidence level that gets reportes to the user. So basically every red dot above 0.8 is bad. It's easy to check that way how messy your analysis gets if you start to accept lower confidence scores as "truth". Would be great to have that for IDTAXA.
The way SINTAX and the RDP Classifier work does not require trimming of the sequences. IDTAXA does because it incorporates distance into its confidence. I will add that information to the documentation in the next release. Thank you for the suggestion.
It is not possible to do quick leave-one-out validation with IDTAXA because of the way the algorithm is constructed. Again, this is different than SINTAX and the RDP Classifier. With IDTAXA you have to remove each sequence, retrain with LearnTaxa()
, and test with IdTaxa()
. This is because the learning step is non-trivial with IDTAXA and there is no way to simply remove a reference sequence after the fact. However, you could setup a straightforward pipeline to perform relatively quick k-fold cross-validation with all algorithms. This is arguably more realistic than leave-one-out cross-validation anyhow.
Thank you for your feedback! I cannot put my finger on it but now that I trimmed all sequences down to ITS2 there is still some unwanted behaviour.
tldr: ITS has pretty big genetic distances. If the species of the query sequence is missing from the database the next best match within the genus has sometime up to 15% genetic distance. This has biological reasons. This seems to confuse IDTAXA into believing that there is nothing closely resembling the query in the database. As a result the assignment score to kingdom level drops below 50 in quite a lot of cases, making the sequence classification worthless for downstream analysis. Is it possible that IDTAXA has been built with 16S in mind which has much lower genetic distances? In that case it would benefit from an exploratory phase in which the algorithm samples some genetic distances to try to guess at which distance one would expect the sequence to belong to another family. I understand you need to make some assumptions for this, but otherwise I cannot see it performing well in plants or other markers with high genetic variation.
Here is my trained Database (use load()
): (contains fungi and plant ITS2 sequences)
https://easyupload.io/pce2tv
Okay, so this might be beacuse my database is a bit noisy, but I am a plant person and I have no choice. I am using this database: http://its2.bioapps.biozentrum.uni-wuerzburg.de/
I am getting a lot of assignments like this: [1] "Root [57.2%]; unclassified_Root [57.2%]"
And I am getting a lot of assignments even to kingdom level that do not meet the default confidence level of 60: "Root [52.3%]; Plantae [52.3%]; Magnoliopsida [51.9%]; Rubiaceae [47.9%]; Sherardia [45.8%]; arvensis [45.8%]"
When I check the sequences in BLAST I get a clear assignment, at least to family level. My guess is that some sequences are not correctly annotated and not removed by the IDTAXA learning phase. (i.e. plant sequences assigned to fungi). I tried to debug the function that assignes taxonomy but it's too difficult for me. I followed the tutorial. I tried both options for
allowGroupRemoval
.Here is my trained Database (use
load()
): https://easyupload.io/pce2tvHere are the two examples from above:
GCAGGATCCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTACACCACTCAAGCTATGCTTGGTATTGGGCGTCGTCCTTAGTTGGGCGCGCCTTAAAGACCTCGGCGAGGCCACTCCGGCTTTAGGCGTAGTAGAATTTATTCGAACGTCTGTCAAAGGAGAGGAACTCTGCCGACTGAAACCTTTATTTTTCTAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATA
CCGTGAATCATCGAGTTTTTGAACGCAAGTTGCGCCCGAGGCCACCCGGCCGAGGGCACGTCTGCCTGGGCGTCACGCATCCCGTCGTCGCACCAAGTCTTGCTTGGCGCGGCGGAAGTTGGCCTCCCGTTCCCCCCGCGGCGCGGTTGGCCCAAATGCGAGTCCCCGGACAAGGGACGTCACGACTTCAGGTGGTTGAAATCACTTTCATTCTCGCTCCGAGTCTTGACGATCCCCCTGTGGTTATATAACGACCCTAGAGCCTTCACCGCGCTCACTGGGTTGAGTTCAACGAGATCCTCGAACGCGACCCCGGGTCAGGCGGGATCACCCACTAAGTTTAA
Thank you @digitalwright for allowing to ping your name here. Best wishes, Andreas