A table matching sequences with the taxonomic assignment

irc47 commented 5 months ago

Is your feature related to a problem?

I often want to find the sequence that was used for a particular assignment. I especially want to be able to do this when using the workflow with students to help "demystify" the process and to have them look at sequence similarities and I'd like to be able to get this information without using the bam files.

Describe the solution you'd like

Is it possible to add an option that outputs a simplified TSV with NCBI accession, query sequence (and maybe mapping quality) in addition to a reference table that provides the taxonomy associated with eachNCBI accession?

Describe alternatives you've considered

It is possible to do this myself and have this workflow save the bam files and then pull the NCBI accession and query sequence from there, but I'm looking for an option for people not yet comfortable with the command line. Also, one challenge with this is the additional step of needing to match the NCBI accessions with the taxonomies, so it would be great if that were available from this output either in the same table or as a reference table.

Additional context

No response

nggvs commented 5 months ago

Hi @irc47 , Thank you for your suggestion! We'll consider it for future releases. I'm afraid that output all the query sequences in a different file will end up in writing a massive file. For which approach would you like it? Minimap2 or kraken2? In the kraken2 approach you can output a table with the reads and the taxonomy using --include_kraken2_assignments. Or you mean a TSV with queryID, mapping quality and taxonomy (not the taxID, but the whole name) ? Could you add an example of what you'd like so that I can understand it better?

Thank you very much!

irc47 commented 5 months ago

I have been using minimap2, I hadn't realized that the kraken2 provides this option. I just gave that a try, but I would still prefer to use minimap2 in most cases. However, I think the kraken2-assignments output is pretty much exactly what I'm looking for - it has read ID, TaxID and the taxonomic assignment -so the request would be a similar option in minimap2.

What I find very challenging about using the bam outputs from minimap2 is that it seems to provide only the NCBI accession numbers and I haven't figured out how to do a bulk pull of the correlated taxonomies. A table similar to the Kraken2 table would be a great solution from my perspective, but I also wonder if the easiest solution would be to capture in an output the step where these accessions are translated to taxonomic assignment so that it could be used to decode the bam files?

--Ilana

nggvs commented 4 months ago

Hi @irc47 , I have added this in our list to possible additional features, so I'll close the issue meanwhile. Thank you for using the workflow!

rocherbpb commented 3 months ago

Hi @nggvs Can you tell me what the status is of the read ID/TaxID table option for the minimap classifier? Also, are you considering including a consensus taxonomy option like LCA for the minimap classifications?

nggvs commented 2 months ago

Hi! @rocherbpb and @irc47 , Latest version 2.11.0 output these tables when using the --include_read_assignments option

irc47 commented 1 month ago

Thank you, this is great!

irc47 commented 1 month ago

Did you consider also including a column for read quality score and one for mapping quality (or if not an overall mapping score then one for the mapping coverage and one for percent identity)? That would be a further enhancement that would make this even table even more useful.

nggvs commented 1 month ago

Hi @irc47 ! Glad you like the feature! Please open a new issue for new features, so that I can track them. Thank you very much for using the workflow!

epi2me-labs / wf-metagenomics