Coloring strain plots and why I decided not to implement it for now

Over in #50, I implemented a function that works with the sourmash taxonomy annotate output to detect whether multiple strains of a given species in a metagenome sample have multiple strains present or not. I toyed with the idea of trying to count the number of strains likely present mostly by clustering the abundances of the matched genomes. I would then color each matched genome by the strain that I guessed it belonged to (strain1, strain2, strain3, etc.). I've decided to punt on this for now because I don't think the gather/taxonomy output have enough information to do this well -- While different strains may sometimes cluster by abundance, I think it's likely that the first genome match will scoop in k-mers from multiple strains, and because we mostly report average k-mer abundance in the gather output, deconvolving these abundances is basically impossible. I think the right thing to do here would be to take a genome-grist esque approach where for a given set of genome matches within a species, we download all of them and iteratively map k-mers or reads to those genomes. Then we could use an expectation maximization algorithm to assign k-mers/reads or a genome. Alternatively, we could align to everything at once and then still use an EM algorithm that takes advantage of all the read mapping info to do the assignation. This would be a big lift for relatively little payoff -- the perk of this sourmash approach is that it's fast, and the idea is that you could use it to detect strain variation and then used heavier tools to dig in. Do a big mapping and then EM would be a big separate endeavor.

Dumping some code i ripped out of the function that dealt with abundances and trying to guess how many strains were present

abundances

  # ABUNDANCE -- I don't actually know what to do here, so to start,
  # I'm just coding to flag species where average kmer abundances for genomes deviate by more than 2.
  # average_abund <- taxonomy_annotate_df %>%
  #   dplyr::filter(.data$species %in% more_than_one_genome_observed_for_species$species) %>% # filter to species with more than one genome observed
  #   dplyr::group_by(query_name, species) %>%
  #   dplyr::summarise(min_average_abund = min(average_abund),
  #                    max_average_abund = max(average_abund),
  #                    sd_average_abund = sd(average_abund)) %>%
  #   dplyr::mutate(range_average_abund = max_average_abund - min_average_abund)

  #average_abund_filtered <- average_abund %>%
  #  dplyr::filter(range_average_abund >= 10)

guessing strains present

  # the below code won't work, but I think this logic could be used to draw delineations to count the number of strains.
  # I'm not totally sure yet how to get this logic to work with facet_wrap() to show colors --
  # probably something like calculating it in a different data frame and then joining it to the df that's plotted.
  # I would probably make a column like "strain" where I would label each dot "strain1", "strain2", etc. based on which intervals in the sd that the abundance falls in.
  # seq(average_abund$min_average_abund, average_abund$max_average_abund, by = average_abund$sd_average_abund)

  # I think I could also use logic to label potential prophages -- something like less than 3% of the genome with >100 more abundant than any other match for that species.
  # I need to validate this first though, potentially using SRR492184 Enterococcus faecalis.
  # Genome-grist on this sample would probably be the easiest thing to do.

  # plot_df <- taxonomy_annotate_df %>%
  #   dplyr::mutate(query_name_species = paste0(.data$query_name, "-", .data$species)) %>%
  #   dplyr::filter(.data$query_name_species %in% f_match_filtered$query_name_species)
  #
  # # label with strain count guesses before plotting
  # for(query_name_species in unique(plot_df$query_name_species)){
  #   print(query_name_species)
  # }

Arcadia-Science / sourmashconsumr

Coloring strain plots and why I decided not to implement it for now #51