Closed jpalmer37 closed 5 years ago
Wow that's awful. Ok it looks like we need to limit the time range - can you make a boxplot showing branch length (time) grouped by number of insertions?
also can you post the output of summary(
Here's a boxplot of the branch lengths stratified by insertion counts. Numbers 1,2,3,4 actually correspond with counts 0,1,2,3.
Variable Loop 1
Call:
glm(formula = vr.df[[1]]$Count ~ vr.df[[1]]$Date, family = "poisson")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6992 -0.1762 -0.1545 -0.1474 4.4978
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.544850 0.160940 -28.24 <2e-16 ***
vr.df[[1]]$Date 0.008707 0.000683 12.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 491.72 on 3116 degrees of freedom
Residual deviance: 416.06 on 3115 degrees of freedom
AIC: 531.51
Number of Fisher Scoring iterations: 6
Variable Loop 2
Call:
glm(formula = vr.df[[2]]$Count ~ vr.df[[2]]$Date, family = "poisson")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0887 -0.1304 -0.1135 -0.1081 4.1389
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.1658586 0.2182198 -23.67 <2e-16 ***
vr.df[[2]]$Date 0.0088688 0.0008976 9.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 302.95 on 3116 degrees of freedom
Residual deviance: 257.91 on 3115 degrees of freedom
AIC: 326.52
Number of Fisher Scoring iterations: 7
Variable Loop 4
Call:
glm(formula = vr.df[[4]]$Count ~ vr.df[[4]]$Date, family = "poisson")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5191 -0.1299 -0.1147 -0.1099 2.8475
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.131316 0.222611 -23.051 < 2e-16 ***
vr.df[[4]]$Date 0.007867 0.001112 7.073 1.52e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 278.61 on 3116 degrees of freedom
Residual deviance: 253.83 on 3115 degrees of freedom
AIC: 317.83
Number of Fisher Scoring iterations: 7
Variable Loop 5
Call:
glm(formula = vr.df[[5]]$Count ~ vr.df[[5]]$Date, family = "poisson")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2457 -0.1280 -0.1139 -0.1094 5.5325
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.137943 0.228094 -22.526 < 2e-16 ***
vr.df[[5]]$Date 0.007285 0.001273 5.721 1.06e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 270.49 on 3116 degrees of freedom
Residual deviance: 253.39 on 3115 degrees of freedom
AIC: 310.39
Number of Fisher Scoring iterations: 7
I tested a wide range of cutoffs (between 10 and 500) for branch length values and found that a maximum of 325 days gave the best result (which entailed slight improvements in the model fits), based on a pseudo R-squared value. However, even after applying this cutoff, the data still appear to follow the same trend. Any suggestions? V1 V2 V4 V5
Ok so these QQ plots are telling us that there are a substantial number of cases where the observed numbers of insertions are far greater than the predicted numbers. We need to know which cases these correspond to and have a look at them. You can do this by extracting the first 10 indices order(residuals(fit), decreasing=TRUE)[1:10]
.
I looked at these indices within the data set. Like you said, all of these entries had more insertions than what would have typically been predicted (relatively low branch length, but 1 or more insertions). For reference:
Accno Vloop Count Date Seq Pos
1761 KY323927 1 3 65.75320 ATG,GGGAAT,AGTACCACTAGTGGT 432,467,503
2815 KF996655 1 2 64.40380 T,GG 448,468
381 MF500630 1 2 151.93698 TGGGAA,GGGAATAGT 505,591
29118 KC247501 1 1 10.12648 AGCAATACC 551
4967 KC247494 1 1 15.61010 TAC 692
2265 MF500654 1 1 21.26492 CTACCAACAATGGGAGTAGTAATG 567
4138 KC247499 1 1 29.29904 TACCAA 766
33612 KC247663 1 2 216.76156 AATACC,AATATC 641,820
5265 KT220758 1 1 39.46744 ACCCACAATATT 515
906 MF500634 1 1 41.75194 TTTAAACTG 578
5064 KT220758 1 1 44.13439 TTACCCACAATA 408
3369 JX972552 1 1 45.71253 GGAAATAAT 438
16134 KC247662 1 1 46.89895 GGGAATGAATAGCAGTATATTAGA 1034
2165 MF500560 1 1 47.33829 AGTGGGATC 864
896 MF500541 1 1 51.02274 ATAATATCA 432
831 MF500576 1 1 51.37626 AGACTA 447
113 MF566015 1 2 224.16132 A,AGT 409,416
2621 KF996669 1 1 63.24380 GAAAGT 461
8261 MF500576 1 1 63.82567 ACTAAG 494
5267 KC247659 1 1 69.14661 GTATAATAGAGGGA 743
22133 KC247546 1 1 72.62658 GCCAGCAATAGCAGTATAAACTGT 539
6165 KC247547 1 1 83.47979 AATAGCAG 686
11118 JX973331 1 1 84.24822 ATATTACTGACA 477
1065 MF500636 1 1 86.84139 AGTAAT 672
I tried removing these entries from the data set to see what would happen and found that they cause the GLM to fail.
> fit2 <- glm(finalized2$Count ~ finalized2$Date, family="poisson")
Warning message:
glm.fit: algorithm did not converge
> summary(fit2)
Call:
glm(formula = finalized2$Count ~ finalized2$Date, family = "poisson")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.667e-06 -1.667e-06 -1.667e-06 -1.667e-06 -1.667e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.730e+01 1.132e+04 -0.002 0.998
finalized2$Date -3.134e-16 1.874e+02 0.000 1.000
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 0.0000e+00 on 3055 degrees of freedom
Residual deviance: 8.4883e-09 on 3054 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
I'm still able to call the EDA on the residuals though:
There's something very strange going on where the null deviance is 0
.
Instead of box-and-whisker plots, can you show me a ridge plot? (histograms or kernel densities on the same plot but offset by some arbitrary amounts).
I think what's happening is that the cases with 2+ insertions do not trend with longer branch lengths and that's screwing up the Poisson model. We might have to look into a negative binomial model or Poisson mixture model.
Is this what you were looking for?
This describes the data set after applying the 325 day maximum, but still containing the entries with the highest residuals.
yeah that looks right. instead of densities can you please use histograms? Either that or reduce the bandwidth
Is this a little better?
What you mentioned previously regarding the 2+ insertion cases makes sense to me. It's unfortunate that a standard Poisson model might not be appropriate for this data. I'll starting reading up on the other two models you mentioned to get familiar with them.
Ok - there only four cases of 2 insertions, so let's ignore that distribution. Can you send me the data frame with just times and counts?
The original plot was describing V1 only by mistake. This one covers all variable loops. (The trend is almost identical strangely enough).
Here's the data in CSV format with only the times and counts like you asked.
Ok I've figured out that I was wrong about how we fit the Poisson model when observations have varying amounts of time.
Here is a toy example:
> x <- rexp(1000)
> y <- rpois(1000, 0.5*x)
> table(y)
y
0 1 2 3 4 5 6
668 219 77 22 6 6 2
> fit <- glm(y~1, offset=log(x), family='poisson')
> summary(fit)
Call:
glm(formula = y ~ 1, family = "poisson", offset = log(x))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1093 -0.8151 -0.4591 0.1825 3.4702
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6906 0.0445 -15.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 812.92 on 999 degrees of freedom
Residual deviance: 812.92 on 999 degrees of freedom
AIC: 1567.8
Number of Fisher Scoring iterations: 5
> exp(-0.6906)
[1] 0.5012752
From my experiments it looks like the right limit of the Q-Q plot skews upwards even on simulated data when the rate is low.
This is an EDA performed on the V1 data set, as an example. V2, V4, and V5 all followed the same trend very closely.
Here's the call for reference:
Call:
glm(formula = finalized$Count ~ 1, family = "poisson", offset = log(finalized$Date))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.37360 -0.13842 -0.07523 -0.03775 3.13062
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.4432 0.2041 -41.36 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 193.71 on 3106 degrees of freedom
Residual deviance: 193.71 on 3106 degrees of freedom
AIC: 243.71
Number of Fisher Scoring iterations: 8
And the rates that I retrieved from the GLMs.
Loop | Insertion Rate (insertions/nt/year) |
---|---|
V1 | 1.844322e-03 |
V2 | 6.092678e-04 |
V3 | 4.217497e-13 |
V4 | 1.091617e-03 |
V5 | 2.382050e-03 |
With counts of 2 and 3 left in:
This distplot
(checking for Poissonness) shows a somewhat poor fit to the model when 2 and 3 counts are included.
Calling Ord_plot
on the data appeared to select a log-series model as the best fit. This choice might also be due to the effects of 2/3 counts(?). Not sure. (I couldn't call this on only 0 and 1 counts because no meaningful result was produced)
With counts of 2 and 3 removed:
With 2 and 3 counts removed, I created residual plots using the simulateResiduals
function in the DHARMa package, which simulates new data from the model and compares it to the observed data.
This was a test for overdispersion. Appeared to register negative.
> dispersiontest(fit)
Overdispersion test
data: fit
z = -3.504, p-value = 0.9998
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
0.9848534
A zero-inflated Poisson model was not any better than our chosen model based on AIC.
> fit <- glm(finalized$Count ~ 1, offset=log(finalized$Date), family="poisson")
> fit2 <- zeroinfl(finalized$Count~1, offset=log(finalized$Date), dist="poisson")
> AIC(fit, fit2)
df AIC
fit 1 397.4853
fit2 2 399.4854
I also created this influencePlot
but I'm still figuring out how to interpret it.
Are there other specific tests you would like me to run?
Here's the result of the EDA performed using the BSDA package