epi2me-labs / wf-metagenomics

Metagenomic classification of long-read sequencing data
Other
45 stars 21 forks source link

A table matching sequences with the taxonomic assignment #100

Open irc47 opened 1 month ago

irc47 commented 1 month ago

Is your feature related to a problem?

I often want to find the sequence that was used for a particular assignment. I especially want to be able to do this when using the workflow with students to help "demystify" the process and to have them look at sequence similarities and I'd like to be able to get this information without using the bam files.

Describe the solution you'd like

Is it possible to add an option that outputs a simplified TSV with NCBI accession, query sequence (and maybe mapping quality) in addition to a reference table that provides the taxonomy associated with eachNCBI accession?

Describe alternatives you've considered

It is possible to do this myself and have this workflow save the bam files and then pull the NCBI accession and query sequence from there, but I'm looking for an option for people not yet comfortable with the command line. Also, one challenge with this is the additional step of needing to match the NCBI accessions with the taxonomies, so it would be great if that were available from this output either in the same table or as a reference table.

Additional context

No response

nggvs commented 1 month ago

Hi @irc47 , Thank you for your suggestion! We'll consider it for future releases. I'm afraid that output all the query sequences in a different file will end up in writing a massive file. For which approach would you like it? Minimap2 or kraken2? In the kraken2 approach you can output a table with the reads and the taxonomy using --include_kraken2_assignments. Or you mean a TSV with queryID, mapping quality and taxonomy (not the taxID, but the whole name) ? Could you add an example of what you'd like so that I can understand it better?

Thank you very much!

irc47 commented 3 weeks ago

I have been using minimap2, I hadn't realized that the kraken2 provides this option. I just gave that a try, but I would still prefer to use minimap2 in most cases. However, I think the kraken2-assignments output is pretty much exactly what I'm looking for - it has read ID, TaxID and the taxonomic assignment -so the request would be a similar option in minimap2.

What I find very challenging about using the bam outputs from minimap2 is that it seems to provide only the NCBI accession numbers and I haven't figured out how to do a bulk pull of the correlated taxonomies. A table similar to the Kraken2 table would be a great solution from my perspective, but I also wonder if the easiest solution would be to capture in an output the step where these accessions are translated to taxonomic assignment so that it could be used to decode the bam files?

--Ilana