JensUweUlrich / Taxor

Fast and space-efficient taxonomic classification of long reads
BSD 3-Clause "New" or "Revised" License
43 stars 2 forks source link

Visualization of the results file #11

Closed DuttaAnik closed 3 weeks ago

DuttaAnik commented 1 month ago

Hi, Thanks for developing the tools. It looks great. I was wondering if there is any way of visualizing the results using Pavian or Krona as it does for Kraken reports.

JensUweUlrich commented 1 month ago

Hi @DuttaAnik

There is a way to visualize the results with Krona plots by simply counting the number of reads assigned to each taxID, which is the input for KronaTools. This works fine to get the species abundance in short-read Illumina datasets, where all reads have the same length. For long nanopore reads, with a broad range of read lengths, this will bias your species abundance analysis. Taxor's profiling module does not count the reads to get the sequence abundance, it counts the number of base pairs. And this metric is not correctly displayed in Krona or Pavian. The only workaround, at least for Krona,would be to take the percentage values on the species level from the sequence abundance file and multiply the with 100. This would give you normalized fake read numbers, which should then correctly visualize the adundances in Krona. But I did not try that out yet.

DuttaAnik commented 1 month ago

Hi @JensUweUlrich Thanks for the explanation. Related to this, I have another question now. So, after running the taxor search for long-read sequence data, I can see the desired organisms (species level) in the report file. But when I check the abundance file after running taxor profile, I do not see all the desired organisms. Is it because of the broad range of read lengths?

JensUweUlrich commented 3 weeks ago

@DuttaAnik This is a problem you will recognize with almost every taxonomic classifier that works directly with reads as input. When you use a database with genomes that are very similar to each other it can happen that you have false classifications of some reads. Many tools like KMCP, MetaMaps and Taxor try to get around this issue by using an EM algorithm to re-assign those reads. But sometimes that doesn't resolve the issue and in case of Taxor all reads that match to the correct species are assigned to a very similar species. That is also the reason why I would never use k-mer based taxonomic classification for subspecies or strain-level identification. In cases, where I recognized such behavior, the genome of the reported species by Taxor was almost identical to the one of the desired species. As a workaround, you can try to play a bit around with the error-rate or percentage thresholds in taxor search or by using larger k-mer sizes in taxor build, which could help you to increase the specificity of the results.