PoonLab / gromstole

Quantifying SARS-CoV-2 VoCs from NGS data of wastewater samples
MIT License
3 stars 4 forks source link

Missing variant frequency estimates #48

Closed ArtPoon closed 2 years ago

ArtPoon commented 2 years ago

A user reported that the estimated frequency for a variant was NA for two samples in their run. We investigated and found that this was being caused by a numerical overflow issue:

> y <- c(0, 2929, 855, 1, 1, 238, NA, NA, NA, NA, NA, 0, NA, 1, 1, 2, NA, 1, NA)
> n <- c(2, 2991, 857, 1, 2, 245,  0,  0,  0,  0,  0, 184, 0, 1, 1, 2748, 0, 1, 0)
> fit <- glm(cbind(y, n-y) ~ 1, family='quasibinomial')
> summary(fit)

Call:
glm(formula = cbind(y, n - y) ~ 1, family = "quasibinomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-444.88   -41.59   -10.77     0.00     0.00  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.547e+13  9.195e+14   0.049    0.962

(Dispersion parameter for quasibinomial family taken to be 1.320451e+18)

    Null deviance:   8869  on 10  degrees of freedom
Residual deviance: 215819  on 10  degrees of freedom
  (8 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 10

When we calculate variant frequency as exp(fit$coef) / (1+exp(fit$coef)), the exp overflows. A simple solution is to screen for fit$coef > 100 and assign 1 instead of calculating this inverse logit expression.

ArtPoon commented 2 years ago

A better approach would be to use non-parametric bootstrap resampling to estimate confidence intervals. I tried a very rudimentary approach here:

> boots <- sapply(1:1000, function(i) {
+   idx <- sample(1:length(y), length(y), replace=T)
+   fit1 <- glm(cbind(y[idx], n[idx]-y[idx]) ~ 1, family='quasibinomial')
+   fit1$coefficients[1]   
+ })
> inv.logit <- function(x) { 
+   if (x > 100) {
+     return (1)
+   }
+   exp(x) / (1+exp(x))
+ }
> res <- sapply(boots, inv.logit)

which yielded the following histogram:

ArtPoon commented 2 years ago

This yields the following mean and 95% confidence interval:

> summary(res)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.1084  0.9266  0.6319  1.0000  1.0000 
> quantile(res, c(0.025, 0.975))
 2.5% 97.5% 
    0     1 
ArtPoon commented 2 years ago

Since this seems to be associated with calling BA.1, we should check if our constellation file is out of date and if updating it ameliorates the problem

ArtPoon commented 2 years ago
Abayomi-Olabode commented 2 years ago

According to cov-lineages.org, our constellation list is missing 4 substitutions.

ArtPoon commented 2 years ago

We should be able to simply update our submodule clone

Abayomi-Olabode commented 2 years ago

Sure

ArtPoon commented 2 years ago

@GopiGugan we should always attempt to update the submodule every time the pipeline is being run

GopiGugan commented 2 years ago

@GopiGugan we should always attempt to update the submodule every time the pipeline is being run

Okay, I'll add that in

ArtPoon commented 2 years ago

At the meeting this week, the users indicated that they want these cases to be reported as "poor sequencing coverage" or "insufficient data to estimate" or something along those lines - rather than our current method of drawing the bar with mean and confidence interval for the bootstrap estimates.