High Error for Likelihood Data Based on Diagnostic Date

PoonLab / tn

Optimization of genetic clustering methods by predictive modeling

GNU General Public License v3.0

0 stars 0 forks source link

High Error for Likelihood Data Based on Diagnostic Date #19

Closed ConnorChato closed 4 years ago

ConnorChato commented 5 years ago

With Meta-Data decay_tn_met.pdf

Without Meta-Data decay_tn_noMet.pdf

ConnorChato commented 5 years ago

Things Tried

Re-Obtain Data and Investigate Files to insure they're correctly named and sourced
Check IDs to ensure that the same sequences are being used in both cases
Check Diagnostic dates to ensure that they are reasonable (when compared with collection dates for the same ID)
Filter out years with a small total number of cases, or a time difference more than 14 years from the most recent time point.
Re-obtain age-data using original comp_an.R script instead of mtD_Snip.R or comparable_metD_snip.R
Confirm that source files and meta-data csv aren't clearly broken

ConnorChato commented 5 years ago

Bizarrely - This lower-fit model is still making predictions which outperform our Collection dated model.

Screenshot from 2019-09-16 22-14-35

ConnorChato commented 5 years ago

Screenshot from 2019-09-17 10-41-55

Measuring just edge Frequency (Edge Count / Vertex Count) for the span of diagnostic years reveals some of the trends that might be causing this. Two periods of unusually high transmission in the mid 90s followed by a general decline in transmission after 1999 (Possibly associated with ART?)

ArtPoon commented 5 years ago

A cool result here is that the GAIC-optimized threshold is robust to switching predictor variables - it doesn't matter whether you use sample collection date or diagnosis date, the best threshold is around 0.015

ConnorChato commented 5 years ago

Right - I guess that's because the graphs are most informative at that point. Whatever model/params we throw at those graphs should be doing best because it has the best balance of Meaningful Variation and growth / size

ArtPoon commented 5 years ago

The other important thing is that even though the threshold does not change, the model with diagnosis date does outperform the other model and so does a better job of predicting new cases.

ConnorChato commented 5 years ago

Without Edge Density Taken into Account

With Edge Density Taken into Account ageDi$Positive/(ageDi$vTotal*ageDi$oeDens)

Ridge is still a bit visible, but I think it does help a bit

ConnorChato commented 5 years ago

Fixed in upcoming push. I added "oeDens" as a covariate when analyzing a given sub-graph (for a bipartite graph from an old year to a new year, oeDens represents the edge density of the old year). This effectively normalizes for the different mean genetic distances for edges incident on different years. Typically, this has a minimal effect on GAIC, however, the glm() function to less sensitive to an outbreak. The tracking of edge density also allows us to remove the very extreme outliers seen in the Tennessee Diagnostic decay plots a few years in the mid 90s ( years with >280 edges above 96% similarity). This results in a more stable looking plot of bipartite edge density and a more robust data-set.

ConnorChato commented 4 years ago

The Subtree Clustering responds to this problem. When finding each tip's closest neighbor-tip from a past year, tips from 1992 come up much more often than expected.

I calculated this as a frequency Frequency Y = (# of Closest Tips from year Y) / (# of Tips in total from year Y).

Then normalized to the average frequency to get an apparent bias Bias = (Frequency Y) / (Mean Frequency of Other Years)

Taking these biases into account has an extremely dramatic effect on the predictive model performance (min GAIC of -15 to min GAIC of -350), but I'm worried that I'm really just measuring the effect of a sampling problem here. That may still be fine, but I thought I'd run this by you because it's a fairly extreme result.

ArtPoon commented 4 years ago

Wow, that's huge! Well I think that some sort of normalization is necessary in order for TN93 and subtree clustering to be compared fairly, but we should really give ourselves time to figure this our properly. I think the "preliminary results" phrasing of the abstract gives us a way out.

ConnorChato commented 4 years ago

Currently Fixed