Add a function for an alluvial plot when when a time series metagenome was sequenced

Arcadia-Science / sourmashconsumr

Working with the outputs of sourmash in R

Other

21 stars 3 forks source link

taxonomy_annotate_df <- read_taxonomy_annotate(Sys.glob("~/github/2022-prjna853785-sourmash/outputs/sourmash_taxonomy/SRR*lineages*csv"))

tmp <- readr::read_csv("https://raw.githubusercontent.com/Arcadia-Science/2022-prjna853785-sourmash/main/inputs/metadata.csv") %>%
  select(query_name = run_accession, time = age_months)

plot_taxonomy_annotate_ts_alluvial(taxonomy_annotate_df, time_df = tmp, tax_glom_level = "genus")

Addresses one piece of #35

Some things that could be improved that I'll make issues for bc I don't see the point in tackling them yet:

I changed the tax glom function so that it can return either n_unique_kmers or f_unique_to_query. I just made an if statement to control the function and what's returned. If I add another var later, I'll make this more sophisticated so the code chunks are copy and pasted, but I think it's good enough for now, and is simpler to read right now.
i made it so that the time_df needs to have hard coded column names that are query_name and time. I could make this more flexible, but I documented the behavior and provided hints at runtime for the user, so I think that's good enough for now as this strategy dramatically simplifies the code.

One high level question: In the example plot given there is a color for "other" genera - was this manually defined somewhere before or does the function assign genera that are below some % abundance as just "other"? A suggestion related to this is to only show the top X genera/species etc. provided by the user, such as in the ampvis2 package can provide tax_show (https://kasperskytte.github.io/ampvis2/articles/ampvis2.html) so that for complex communities this doesn't become a mess

Oooooh I had never seen ampvis2, I'll be using that as inspiration!

So how it works right now is it uses a fraction_threshold (by default, 0.01, or 1%) -- if a microbe is present in any of the time series at 1% or greater, it gets an alluvial ribbon in the plot. The user can change the fraction_threshold to anything they want it to be. Anything that does not get an alluvial ribbon gets automatically clumped into "other" via a process implemented in the function.

I like the idea of tax_show. I'll make an issue and add this as an enhancement -- that way, users can either provide a list of taxa to tax_show or use fraction_threshold.

Arcadia-Science / sourmashconsumr

Add a function for an alluvial plot when when a time series metagenome was sequenced #37