[note: this is an experimental/draft PR and should not be merged as-is]
ref #72
In sourmash taxonomy, we're adding utils to use the LIN taxonomic framework, which allows for greater flexibility and specificity compared with standard taxonomic ranks. For example, if only certain strains of a microbe are pathogenic, the LIN framework may be useful for identifying/grouping pathogenic vs non-pathogenic strains. Is this something you're interested in allowing for viz?
We had a question about whether sourmashconsumr would work with LIN lineages for e.g. sankey plots, so I decided to experiment a little to see how easy/hard it would be to allow LIN functionality.
This PR has lins semi working for:
tax_glom_taxonomy_annotate
plot_taxonomy_annotate_sankey
plot using sourmash test data from the lins PR (tests/test-data/tax/test1.gather.csv annotated with tests/test-data/tax/test.LIN-taxonomy.csv):
Challenges and thoughts
LIN positions are not always a set length, though I believe 20 positions is currently the LINbase standard. Selecting a default to summarize if the user doesn't provide one is currently hacky and would need some thought.
LIN position "names" (numbers) aren't terribly helpful for visualizations.
"LINgroups" are defined LIN prefixes that have some useful meaning (e.g. we may have named groups for the 0;0;1 prefix vs the0;1;1 prefix). These groups are given names, which would be a bit nicer for plotting. However, using LINgroup names would currently require reading in a separate lingroup file or using the lingroup report from tax metagenome
Thinking about it more, the sankey doesn't actually make sense as it is. Once a lineage diverges, it should never come back together! Might need to work with the full prefix (0;1;1 at a rank/position rather than the individual value).
I am happy to work on this further or drop it, if this isn't something you want to allow!
[note: this is an experimental/draft PR and should not be merged as-is]
ref #72
In sourmash taxonomy, we're adding utils to use the
LIN
taxonomic framework, which allows for greater flexibility and specificity compared with standard taxonomic ranks. For example, if only certain strains of a microbe are pathogenic, the LIN framework may be useful for identifying/grouping pathogenic vs non-pathogenic strains. Is this something you're interested in allowing for viz?We had a question about whether
sourmashconsumr
would work withLIN
lineages for e.g.sankey
plots, so I decided to experiment a little to see how easy/hard it would be to allow LIN functionality.This PR has lins semi working for:
tax_glom_taxonomy_annotate
plot_taxonomy_annotate_sankey
plot using sourmash test data from the lins PR (
tests/test-data/tax/test1.gather.csv
annotated withtests/test-data/tax/test.LIN-taxonomy.csv
):Challenges and thoughts
0;0;1
prefix vs the0;1;1
prefix). These groups are given names, which would be a bit nicer for plotting. However, using LINgroup names would currently require reading in a separate lingroup file or using thelingroup
report fromtax metagenome
0;1;1
at a rank/position rather than the individual value).I am happy to work on this further or drop it, if this isn't something you want to allow!