Weighted Multi-Filtration

AABoyles commented 2 years ago

Description

The current landscape of SARS-CoV-2 Omicron/BA.2 is such that genetic diversity is very, very low. Accordingly, it would be desirable for MicrobeTrace to gracefully "identify" clusters based on additional, not-necessarily-genetic criteria. For example, suppose I have an epidemiological dataset including a location along with my genomic data. Two locations may have clusters composed entirely of samples with 0 genetic diversity, resulting in them being clustered together at any distance threshold. We can mitigate this in some part by drawing polygons around groups according to the location variable, but this may be unsatisfying for some cases.

Initiative / goal

I propose that MicrobeTrace add functionality to filter based upon a weighted sum of multiple transformed criteria simultaneously. This is a substantial departure from the existing model of univariate, distance-threshold-based filtering. I envision a UI slightly similar to the controls in the Flow Diagram view (only not broken, ideally), where you can add an arbitrary number of fields on which to filter. For the numeric fields, the workflow should roughly match that of distance. For nominal fields (e.g. location, as in the example above), we should simple compute a new distance matrix corresponding to that node variable, where each entry is coded 0 if the values match and 1 if the values do not match (note that this violates intuition!).

Each field should also have a slider indicating its relative importance in the distance computation. For example, if location should dominate genetic diversity (again, as in the above case), the the location slider should be increased (or the distance slider decreased) until the two locations can be clearly discerned as distinct clusters.

The computations to make this effective and intuitive are, annoyingly, unintuitive and non-trivial. What you'll need to do is normalize all numeric values to a common 0-1 scale. Then, the slider should represent an exponential weight, which varies in strength based on the number of variables used (e.g. something like the max weight is 10 for 2 variables, 100 for 3, 1000 for 4, etc). This will make the weight sliders "orderable" (in the sense that you can space them uniformly across the available ranges and get intuitive results, such as clustering based on location and link inclusion within clusters based on genetic distance).

Acceptance criteria and must have scope

This will require a large number of modifications to Microbetrace as it exists presently. First and foremost, it requires a fundamental UI redesign of the distance threshold controls, to accommodate both the variable number of included variables and the weight sliders to manage their relative importances. Second, it will require a substantial extension to the filtration logic. I recommend a parallel filter process for Multi-filtration, rather than designing it to subsume the current, univariate filtration as a special case as making it a distinct process will prevent it from slowing down Distance Matrix computations in which it isn't leveraged. Finally, it's a powerful addition to MicrobeTrace's repertoire, but quite complex in comparison to MicrobeTrace's other features at this level of abstraction (it's much harder to describe or train than changing colors to match a nominal variable, for example), so the documentation and trining materials will naturally need to be updated to accommodate this change.

ells commented 2 years ago

Let's think this through from the user's perspective...

Can you add an example use case and perhaps a small mock-up of what you're proposing with a handful of nodes and edges.

I'd also like to see an example with the slider bar on either end.

AABoyles commented 2 years ago

Suppose I'm doing a COVID-19 analysis. I have the following network:

network1

Generated from these input files: MultiFilter-Links.csv, MultiFilter-Nodes.csv

All distances for shown links are 0. This makes sense, since Omicron has spread so much more quickly than it can mutate into new strains.

But I want to determine "clusters" using some additional data at my disposal. For example, say I have location data (Prison/Long-term Care Facility). I could color the nodes according to this...

network2

But it's not very informative, because (again) genetic variance within this outbreak is so ridiculously low. So, as a user I want a tool that will allow me to say "draw the cluster based on the Facility first, and the distance second." When I have that, I can render my network like this:

multicriteria

This works because the importance of the matching facility dominates the genetic distance, clearly separating the links into intra-cluster links (the bars on the left) and inter-cluster links (the bars on the right):

(To render this network for this demo, I had to self-join the nodes table and left-join the links table to determine if a link was between two nodes of a common facility type and set the distance accordingly. In MicrobeTrace this can be done much more efficiently, since the graph isn't stored as relational tables. The R code to accomplish this follows:

library(magrittr)
library(readr)
library(dplyr)

MultiFilter_Links <-
  read_csv("Downloads/MultiFilter-Links.csv") %>%
  mutate(distance = ifelse(distance < 0.015, 0, distance))

MultiFilter_Nodes <-
  read_csv("Downloads/MultiFilter-Nodes.csv")

network <-
  MultiFilter_Nodes %>%
  full_join(MultiFilter_Nodes, by = character()) %>%
  rename(
    source = `_id.x`,
    target = `_id.y`
  ) %>%
  filter(source != target) %>%
  mutate(facility_distance = ifelse(facility.x == facility.y, 0, 1)) %>%
  right_join(MultiFilter_Links) %>%
  mutate(multiDistance = facility_distance + distance)

View(network)

network %>%
  write_csv("Downloads/network.csv")

...which outputs this file: network.csv)

CDCgov / MicrobeTrace