Closed ConnorChato closed 4 years ago
Things Tried
Bizarrely - This lower-fit model is still making predictions which outperform our Collection dated model.
Measuring just edge Frequency (Edge Count / Vertex Count) for the span of diagnostic years reveals some of the trends that might be causing this. Two periods of unusually high transmission in the mid 90s followed by a general decline in transmission after 1999 (Possibly associated with ART?)
A cool result here is that the GAIC-optimized threshold is robust to switching predictor variables - it doesn't matter whether you use sample collection date or diagnosis date, the best threshold is around 0.015
Right - I guess that's because the graphs are most informative at that point. Whatever model/params we throw at those graphs should be doing best because it has the best balance of Meaningful Variation and growth / size
The other important thing is that even though the threshold does not change, the model with diagnosis date does outperform the other model and so does a better job of predicting new cases.
Without Edge Density Taken into Account
With Edge Density Taken into Account ageDi$Positive/(ageDi$vTotal*ageDi$oeDens)
Ridge is still a bit visible, but I think it does help a bit
Fixed in upcoming push. I added "oeDens" as a covariate when analyzing a given sub-graph (for a bipartite graph from an old year to a new year, oeDens represents the edge density of the old year). This effectively normalizes for the different mean genetic distances for edges incident on different years. Typically, this has a minimal effect on GAIC, however, the glm() function to less sensitive to an outbreak. The tracking of edge density also allows us to remove the very extreme outliers seen in the Tennessee Diagnostic decay plots a few years in the mid 90s ( years with >280 edges above 96% similarity). This results in a more stable looking plot of bipartite edge density and a more robust data-set.
The Subtree Clustering responds to this problem. When finding each tip's closest neighbor-tip from a past year, tips from 1992 come up much more often than expected.
I calculated this as a frequency Frequency Y = (# of Closest Tips from year Y) / (# of Tips in total from year Y).
Then normalized to the average frequency to get an apparent bias Bias = (Frequency Y) / (Mean Frequency of Other Years)
Taking these biases into account has an extremely dramatic effect on the predictive model performance (min GAIC of -15 to min GAIC of -350), but I'm worried that I'm really just measuring the effect of a sampling problem here. That may still be fine, but I thought I'd run this by you because it's a fairly extreme result.
Wow, that's huge! Well I think that some sort of normalization is necessary in order for TN93 and subtree clustering to be compared fairly, but we should really give ourselves time to figure this our properly. I think the "preliminary results" phrasing of the abstract gives us a way out.
Currently Fixed
With Meta-Data decay_tn_met.pdf
Without Meta-Data decay_tn_noMet.pdf