PoonLab / vindels

Developing an empirical model of sequence insertion and deletion in virus genomes
1 stars 0 forks source link

EDA within host insertions #63

Closed jpalmer37 closed 5 years ago

jpalmer37 commented 5 years ago

Here's the result of the EDA performed using the BSDA package EDA-insertions-withinhost

ArtPoon commented 5 years ago

Wow that's awful. Ok it looks like we need to limit the time range - can you make a boxplot showing branch length (time) grouped by number of insertions?

ArtPoon commented 5 years ago

also can you post the output of summary()?

jpalmer37 commented 5 years ago

Here's a boxplot of the branch lengths stratified by insertion counts. Numbers 1,2,3,4 actually correspond with counts 0,1,2,3. branch-lengths

Variable Loop 1

Call:
glm(formula = vr.df[[1]]$Count ~ vr.df[[1]]$Date, family = "poisson")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6992  -0.1762  -0.1545  -0.1474   4.4978  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.544850   0.160940  -28.24   <2e-16 ***
vr.df[[1]]$Date  0.008707   0.000683   12.75   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 491.72  on 3116  degrees of freedom
Residual deviance: 416.06  on 3115  degrees of freedom
AIC: 531.51

Number of Fisher Scoring iterations: 6

Variable Loop 2

Call:
glm(formula = vr.df[[2]]$Count ~ vr.df[[2]]$Date, family = "poisson")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0887  -0.1304  -0.1135  -0.1081   4.1389  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -5.1658586  0.2182198  -23.67   <2e-16 ***
vr.df[[2]]$Date  0.0088688  0.0008976    9.88   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 302.95  on 3116  degrees of freedom
Residual deviance: 257.91  on 3115  degrees of freedom
AIC: 326.52

Number of Fisher Scoring iterations: 7

Variable Loop 4

Call:
glm(formula = vr.df[[4]]$Count ~ vr.df[[4]]$Date, family = "poisson")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5191  -0.1299  -0.1147  -0.1099   2.8475  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -5.131316   0.222611 -23.051  < 2e-16 ***
vr.df[[4]]$Date  0.007867   0.001112   7.073 1.52e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 278.61  on 3116  degrees of freedom
Residual deviance: 253.83  on 3115  degrees of freedom
AIC: 317.83

Number of Fisher Scoring iterations: 7

Variable Loop 5

Call:
glm(formula = vr.df[[5]]$Count ~ vr.df[[5]]$Date, family = "poisson")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2457  -0.1280  -0.1139  -0.1094   5.5325  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -5.137943   0.228094 -22.526  < 2e-16 ***
vr.df[[5]]$Date  0.007285   0.001273   5.721 1.06e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 270.49  on 3116  degrees of freedom
Residual deviance: 253.39  on 3115  degrees of freedom
AIC: 310.39

Number of Fisher Scoring iterations: 7
jpalmer37 commented 5 years ago

I tested a wide range of cutoffs (between 10 and 500) for branch length values and found that a maximum of 325 days gave the best result (which entailed slight improvements in the model fits), based on a pseudo R-squared value. However, even after applying this cutoff, the data still appear to follow the same trend. Any suggestions? V1 EDA_V1 V2 EDA_V2 V4 EDA_V4 V5 EDA_V5

ArtPoon commented 5 years ago

Ok so these QQ plots are telling us that there are a substantial number of cases where the observed numbers of insertions are far greater than the predicted numbers. We need to know which cases these correspond to and have a look at them. You can do this by extracting the first 10 indices order(residuals(fit), decreasing=TRUE)[1:10].

jpalmer37 commented 5 years ago

I looked at these indices within the data set. Like you said, all of these entries had more insertions than what would have typically been predicted (relatively low branch length, but 1 or more insertions). For reference:

         Accno Vloop Count      Date                         Seq         Pos
1761  KY323927     1     3  65.75320  ATG,GGGAAT,AGTACCACTAGTGGT 432,467,503
2815  KF996655     1     2  64.40380                        T,GG     448,468
381   MF500630     1     2 151.93698            TGGGAA,GGGAATAGT     505,591
29118 KC247501     1     1  10.12648                   AGCAATACC         551
4967  KC247494     1     1  15.61010                         TAC         692
2265  MF500654     1     1  21.26492    CTACCAACAATGGGAGTAGTAATG         567
4138  KC247499     1     1  29.29904                      TACCAA         766
33612 KC247663     1     2 216.76156               AATACC,AATATC     641,820
5265  KT220758     1     1  39.46744                ACCCACAATATT         515
906   MF500634     1     1  41.75194                   TTTAAACTG         578
5064  KT220758     1     1  44.13439                TTACCCACAATA         408
3369  JX972552     1     1  45.71253                   GGAAATAAT         438
16134 KC247662     1     1  46.89895    GGGAATGAATAGCAGTATATTAGA        1034
2165  MF500560     1     1  47.33829                   AGTGGGATC         864
896   MF500541     1     1  51.02274                   ATAATATCA         432
831   MF500576     1     1  51.37626                      AGACTA         447
113   MF566015     1     2 224.16132                       A,AGT     409,416
2621  KF996669     1     1  63.24380                      GAAAGT         461
8261  MF500576     1     1  63.82567                      ACTAAG         494
5267  KC247659     1     1  69.14661              GTATAATAGAGGGA         743
22133 KC247546     1     1  72.62658    GCCAGCAATAGCAGTATAAACTGT         539
6165  KC247547     1     1  83.47979                    AATAGCAG         686
11118 JX973331     1     1  84.24822                ATATTACTGACA         477
1065  MF500636     1     1  86.84139                      AGTAAT         672

I tried removing these entries from the data set to see what would happen and found that they cause the GLM to fail.

> fit2 <- glm(finalized2$Count ~ finalized2$Date, family="poisson")
Warning message:
glm.fit: algorithm did not converge 
> summary(fit2)

Call:
glm(formula = finalized2$Count ~ finalized2$Date, family = "poisson")

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.667e-06  -1.667e-06  -1.667e-06  -1.667e-06  -1.667e-06  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)     -2.730e+01  1.132e+04  -0.002    0.998
finalized2$Date -3.134e-16  1.874e+02   0.000    1.000

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 0.0000e+00  on 3055  degrees of freedom
Residual deviance: 8.4883e-09  on 3054  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 25

I'm still able to call the EDA on the residuals though: EDA_V1_filtered

ArtPoon commented 5 years ago

There's something very strange going on where the null deviance is 0. Instead of box-and-whisker plots, can you show me a ridge plot? (histograms or kernel densities on the same plot but offset by some arbitrary amounts). I think what's happening is that the cases with 2+ insertions do not trend with longer branch lengths and that's screwing up the Poisson model. We might have to look into a negative binomial model or Poisson mixture model.

ArtPoon commented 5 years ago

https://stats.stackexchange.com/questions/129902/fitting-a-glm-to-a-zero-inflated-positive-continuous-response

jpalmer37 commented 5 years ago

Is this what you were looking for? ridges_plot

This describes the data set after applying the 325 day maximum, but still containing the entries with the highest residuals.

ArtPoon commented 5 years ago

yeah that looks right. instead of densities can you please use histograms? Either that or reduce the bandwidth

jpalmer37 commented 5 years ago

Is this a little better? ridges_plot2

What you mentioned previously regarding the 2+ insertion cases makes sense to me. It's unfortunate that a standard Poisson model might not be appropriate for this data. I'll starting reading up on the other two models you mentioned to get familiar with them.

ArtPoon commented 5 years ago

Ok - there only four cases of 2 insertions, so let's ignore that distribution. Can you send me the data frame with just times and counts?

jpalmer37 commented 5 years ago

The original plot was describing V1 only by mistake. This one covers all variable loops. (The trend is almost identical strangely enough). ridges_plot3

Here's the data in CSV format with only the times and counts like you asked.

ArtPoon commented 5 years ago

Ok I've figured out that I was wrong about how we fit the Poisson model when observations have varying amounts of time.

Here is a toy example:

> x <- rexp(1000)
> y <- rpois(1000, 0.5*x)
> table(y)
y
  0   1   2   3   4   5   6 
668 219  77  22   6   6   2 
> fit <- glm(y~1, offset=log(x), family='poisson')
> summary(fit)

Call:
glm(formula = y ~ 1, family = "poisson", offset = log(x))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1093  -0.8151  -0.4591   0.1825   3.4702  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.6906     0.0445  -15.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 812.92  on 999  degrees of freedom
Residual deviance: 812.92  on 999  degrees of freedom
AIC: 1567.8

Number of Fisher Scoring iterations: 5

> exp(-0.6906)
[1] 0.5012752
ArtPoon commented 5 years ago

From my experiments it looks like the right limit of the Q-Q plot skews upwards even on simulated data when the rate is low.

jpalmer37 commented 5 years ago

This is an EDA performed on the V1 data set, as an example. V2, V4, and V5 all followed the same trend very closely. EDA_V1_new

Here's the call for reference:

Call:
glm(formula = finalized$Count ~ 1, family = "poisson", offset = log(finalized$Date))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.37360  -0.13842  -0.07523  -0.03775   3.13062  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -8.4432     0.2041  -41.36   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 193.71  on 3106  degrees of freedom
Residual deviance: 193.71  on 3106  degrees of freedom
AIC: 243.71

Number of Fisher Scoring iterations: 8

And the rates that I retrieved from the GLMs.

Loop Insertion Rate (insertions/nt/year)
V1 1.844322e-03
V2 6.092678e-04
V3 4.217497e-13
V4 1.091617e-03
V5 2.382050e-03
ArtPoon commented 5 years ago

https://stats.stackexchange.com/questions/70558/diagnostic-plots-for-count-regression

ArtPoon commented 5 years ago

https://stats.stackexchange.com/questions/267461/how-do-i-get-the-residuals-for-a-glm-with-a-binary-response-variable-using-r/268475#268475

jpalmer37 commented 5 years ago

With counts of 2 and 3 left in:

This distplot (checking for Poissonness) shows a somewhat poor fit to the model when 2 and 3 counts are included. distplot Calling Ord_plot on the data appeared to select a log-series model as the best fit. This choice might also be due to the effects of 2/3 counts(?). Not sure. (I couldn't call this on only 0 and 1 counts because no meaningful result was produced)

ord_plot With counts of 2 and 3 removed:

With 2 and 3 counts removed, I created residual plots using the simulateResiduals function in the DHARMa package, which simulates new data from the model and compares it to the observed data. simulateResidualsDHARMa

This was a test for overdispersion. Appeared to register negative.

> dispersiontest(fit)

    Overdispersion test

data:  fit
z = -3.504, p-value = 0.9998
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
 0.9848534 

A zero-inflated Poisson model was not any better than our chosen model based on AIC.

> fit <- glm(finalized$Count ~ 1, offset=log(finalized$Date), family="poisson")
> fit2 <- zeroinfl(finalized$Count~1, offset=log(finalized$Date), dist="poisson")
> AIC(fit, fit2)

     df      AIC
fit   1 397.4853
fit2  2 399.4854

I also created this influencePlot but I'm still figuring out how to interpret it. influencePlot

Are there other specific tests you would like me to run?