PoonLab / clustuneR

Implementing clustering algorithms on genetic data and finding optimal parameters through the performance of predictive growth models.
GNU General Public License v3.0
0 stars 0 forks source link

Switch to logistic regression for additional terms #24

Open ArtPoon opened 10 months ago

ArtPoon commented 10 months ago

By additional terms, I mean relaxing the assumption that every known case has the same probability of being connected to a new case. We currently do this for component-based clustering, where the null model is Growth ~ Size and the alternative model is Growth ~ Weight. Size assumes that every known case has the same probability. Weight adjusts for variation in this probability by fitting a binomial regression to bipartite graphs (varying the difference between timepoints) and applying that model to each known case to calculate its weight. Finally, we sum the weights for known cases for each cluster. This is quite different from the approach we take for subtree clustering, where we simply add the mean cluster age as a term in the Poisson regression model, i.e., Growth ~ Size + Age.

If we switch to a logistic regression framework, then we don't have to worry about discretizing sampling times into years and lining up known cases into bipartite graphs. Working with the set of known cases, we iterate over each one and determine its retrospective edge (i.e., allowing only one "in-edge" from an older case, based on minimum distance for example). That is a positive outcome. Every other known case in the past is a possible "parent", so those are all negative outcomes. Switching to an individual outcome-driven model means we can incorporate differences in continuous time, as well as discordance in risk factors. (We might not have to fit the logistic regression to all known cases, just those within a certain time frame, but then we would need some rationale for choosing a particular time period. Perhaps a time interval of the same duration as that used to define "new cases".)

This approach becomes more difficult for subtree clustering. With distance clustering, we can simply compare pairs of known cases. However, a phylogeny induces a set of latent ancestral nodes that are involved in whether a case is a member of a cluster or not. We are currently using pplacer to determine retrospective edges for new cases by grafting the respective sequences onto the tree, which is a computationally expensive step. To fit a logistic regression model, we would have to apply the same method to all known cases, which would mean rebuilding and regrafting a series of trees.

This induces me to wonder whether pplacer is really necessary. Why can't we just work off the complete tree? We capture the uncertainty in placement of tips from pplacer, but then we just go with the most probable placement anyhow.