PoonLab / vindels

Developing an empirical model of sequence insertion and deletion in virus genomes
1 stars 0 forks source link

Check whether variable loops lengthen/shrink over time #88

Closed jpalmer37 closed 4 years ago

jpalmer37 commented 4 years ago

I adjusted the insertion and deletion rate estimates to account for their nucleotide lengths (essentially counting the rate of nucleotide insertion / deletion over time). I fit a GLM in the two following ways. Both results were essentially the same.

ifit <- glm(nchar(indel$Seq) ~ 1, offset=log(indel$Date), family="poisson")
ifit <- glm(nchar(indel$Seq)*indel$Count ~ 1, offset=log(indel$Date), family="poisson")

When I calculated the differences between these insertion and deletion rate estimates, it seemed that variable loops are in fact shrinking over time. All values below are in units of nucleotides per year per v-loop nucleotide

> ins.df - del.df
Run         V1         V2          V3         V4         V5
1   -23.194469 -14.741912 -0.55020521 -6.9002498 -14.423858
2    -9.653592  -8.144446 -0.49512493 -5.2092047 -12.979906
3   -17.854267  -8.184184 -0.05175768  0.1405488 -13.586392
4   -18.293227  -9.292137 -0.04827084 -3.1750563 -14.206986
5    -6.089090  -8.867034 -0.03523325  3.6427936  -2.204747
6   -17.768955 -16.352905 -0.64202016 -3.5036782 -18.607165
7   -12.834190  -1.989669 -0.63669406  1.4407947 -13.801067
8   -15.764656  -3.151562 -0.54357223 -1.4376948 -14.057177
9   -10.807588  -8.535372 -0.55931450 -3.4501284  -5.227684
10   -2.746858  -9.362723 -0.06608980  0.9372217 -20.292574
11  -15.448736  -7.554436 -0.93957225 -5.9004960 -15.079149
12    6.360926  -2.985741 -0.62311121  1.2615348 -16.191451
13   -1.847524 -14.290859 -0.55208148 -8.3426968  -7.355858
14    3.763810  -2.644840 -1.05793591 -1.7258111  -8.544867
15  -10.455269  -2.652068 -1.07248475 -1.1396161 -18.805575
16  -10.113422  -8.549256 -0.04044326 -1.8873522 -15.913810
17  -17.281045  -2.603679 -0.02827525  2.0475177 -14.844503
18  -13.124685  -7.470638 -0.54862496 -3.5360752 -13.036475
19  -18.005502 -14.536282 -0.55761137 -5.7203236 -13.730905
20   -8.389365  -2.639511 -0.77045186  5.9510765 -17.269017

Summaries on the five v-loops

> summary(indel$V1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-23.194 -17.403 -11.821 -10.977  -7.814   6.361 
> summary(indel$V2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-16.353  -9.310  -8.164  -7.727  -2.902  -1.990 
> summary(indel$V3)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.07248 -0.63803 -0.55114 -0.49094 -0.06251 -0.02828 
> summary(indel$V4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -8.343  -3.954  -1.807  -1.825   1.018   5.951 
> summary(indel$V5)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-20.293 -15.983 -14.132 -13.508 -13.022  -2.205 

I want to double-check my logic on this because the literature says that variable loops lengthen, not shrink over time.

jpalmer37 commented 4 years ago

Here are two histograms that show the time since start of infection (in days) of every insertion/deletion event in all phylogenetic trees of my within-host patient data. There are 20x replicates of each patient performed here, so exact counts are inflated. Tick marks along the bottom show the max (most recent) date of the patient data sets as a reference (which are clearly influencing the trend). I'm working on a plot that shows indel timings normalized to the maximum date.

timing-ins timing-del

ArtPoon commented 4 years ago

Outdated issue, but note that estimates do need to account for the number of nucleotides comprising a given insertion or deletion event.