chanzuckerberg / galago

Interpretation aids for genomic epidemiology
https://chanzuckerberg.github.io/galago/
MIT License
8 stars 3 forks source link

Partial port of `matutils introduce` #94

Closed sidneymbell closed 2 years ago

sidneymbell commented 2 years ago

Calcuting the "mcbroome heuristic" --> assign internal nodes

This PR enables us to assign attribute values to internal nodes using the heuristic developed by McBroome et al in matutils. While (I believe?) matutils introduce uses this method just for categorical geographical data, this implementation generalizes to any arbitrary categorical metadata variable.

Sidenote: I've also ported the two other statistics they utilize (association index and biggest monophyletic subclade) for good measure / we may want them downstream, although they aren't currently being used.

Internal node states --> clusters

To turn internal node assignments -> clusters, I just plugged into the same getAttrChanges method implemented previously to produce "clusters" from nextstrain phylogeography values (i.e., walks the tree and identifies branches where attribute values change from one value to another). I'm sure this method could be overhauled for computational efficiency, but is fine for prototyping.

Clustering results --> user interface for browsing and selecting a set of samples

Finally, I've wired all of this up to the frontend so that users can (1) select the attribute they want to analyze (2) generate clustering results on the fly and (3) see which clusters contain at least one of their input "samples of interest."

Feedback requested.

Initial testing gives reasonable-looking results, but I'd love an extra set of eyes on:

  1. [Most important] The port / implementation of the heuristic calculation. See src/utils/matutilsIntroducePort.ts > calcHeuristic. I've tried to comment verbosely, but lmk if anything is confusing!
  2. [Helpful, less critical] The rest of the src/utils/matutilsIntroducePort.ts file
  3. [Always appreciated!] General feedback on the implementation, UI, direction, etc etc etc

You can also see a live demo here.

Note: why is this PR "merged"?

I've gone ahead and merged the branch since this is strictly a prototype / not in production -- I'm still very much looking for feedback, though! 🙏 😸

jmcbroome commented 2 years ago

Hi Sidney- sorry it took me a little longer to look at this code. I like that it's presented alongside the Nextstrain calculations, as a method for when users want to try to explore the data in a more customized if less informed way.

The calculation seems to have been ported over straightforwardly; the extension of it to other kinds of categorical metadata is interesting, and theoretically valid. Though, using it with pangolin lineages etc is never valid, and probably shouldn't be possible in Galago- the lineage of any node is very simply the most recent ancestor annotated with a specific lineage by the Pangolin team, by definition, and is always known for a given tree (assuming that tree has been annotated appropriately of course). I would also be nervous about its performance with many small regions with high levels of inter-regional transmission; the heuristic performs best between larger areas with relatively well defined borders that significantly reduce transmission, predictably enough. You may want to consider limiting what it can be applied to by Galago.

I also have some questions about the implementation around the core heuristic. I'm not familiar with typescript, so apologies if any of these questions are just misunderstandings. It looks like you're storing only a single combination of confidence and attribute at a time? Attribute in this case being a regional tag, usually? The concern being that if the user queries over, say, three regions, you only want to record the highest confidence region for each node? For clustertracker, if only one region is going to be reported and they have tie values, we usually just report the first one in memory and move on; the main issue here is that the confidence value can become misleading if there are multiple regions with confidence 1 because there are multiple identical children across different regions in the query (users think that the region reported is definitely the origin of a transmission chain, when its more ambiguous because that particular strain was all over the board, including definitely in that region at that time). This is a weakness of the method and clustertracker that seems to be shared by Galago here. Ideally, you would report multiple regions in the case of a tie (you could concatenate the attribute strings with a "/" delimiter, perhaps). In the specific case of samples with multiple regional values and identical genomes leading to a bunch of regions with confidence 1 for their parent, you may want to explicitly identify it as a putative superspreader event, or at least a major regional transmission event.

Additionally, you seem to arbitrarily filter on between 2 and 100 unique values of an attribute. 2 is a fine floor, as 1 is not meaningful, but why the 100 cap? Did you have performance issues? I know when applied to the whole tree of millions of samples, a computation with more than about 70 distinct regions can start to run into memory use issues, but I'm surprised Galago is being applied to trees large enough to encounter those limits.

I'm glad to see my work getting some use and being part of such a useful platform- Jakob M