Jackknife estimate of the variance

scj-robin commented 1 year ago

I do not get the same estimate of the sd using the jackknife option of PLN and doing it by hand. Did I do smthing wrong?

# Test Jackknife variance estimate

library(PLNmodels)

# Make parms
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(barents$Abundance ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature)
X <- lm(barents$Abundance[, 1] ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature, x=TRUE)$x

# Simulate data
Y <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + mvtnorm::rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
d <- ncol(X)

# Jacknife sd estimate
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
betaSd1 <- as.vector(standard_error(pln, type='jackknife'))

# Jacknife sd estimate by hand
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ])
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
betaSd2 <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

# Comparison
plot(betaSd1, betaSd2)

mahendra-mariadassou commented 1 year ago

Hi @scj-robin, correct me if I'm wrong but the difference likely arises from the formula you used for the jackknife.

Forgetting matrix notations for a bit, and noting $\hat{\theta}_{(i)}$ the jackknife estimate of $\theta$ without the $i$-th sample and $\hat{\theta}_{(.)} = \frac{1}{n} \sum_{i} \hat{\theta}_{(i)}$ their mean, you use the expression

\text{var}_{jack}(\hat{\theta}) = \frac{1}{n} \sum_{i}^n ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2

whereas we use Tukey's expression, which is quite different (although slightly conservative, see for example DOI:10.1214/aos/1176345462)

\text{var}_{jack}(\hat{\theta}) = \frac{n-1}{n} \sum_{i}^n ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2

as shown here (where var_jack is $\sum_i ( \hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2$:

https://github.com/PLN-team/PLNmodels/blob/2657e506675ad26e77df94f9c40f02ae11aa1042/R/PLNfit-class.R#L225

scj-robin commented 1 year ago

I do not think I use the first expression you suggest. I think that line

betaSd2 <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

does compute the second formula you mention as the difference of colMeans is precisely

\frac1n \sum_i (\hat{\theta}_{(i)} - \hat{\theta}_{(.)})^2.

mahendra-mariadassou commented 1 year ago

You're completely right and I should stop trying to understand code late at night. I'll look into it.

mahendra-mariadassou commented 1 year ago

OK, I think I understand the source of the problem: our jackknife estimates are not variable enough. I made a branch to export the jackknife estimates (not just the jackknife variance estimator) and got this

# Test Jackknife variance estimate
library(PLNmodels)
#> This is packages 'PLNmodels' version 1.0.4-0200
#> Use future::plan(multicore/multisession) to speed up PLNPCA/PLNmixture/stability_selection.

# Make params
data("barents")
## To keep things fast, use only n = 20 samples
barents <- barents[1:20, ]
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>  DONE!
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)

# Simulate data
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)

# All jackknife estimates
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>   Computing jackknife variance estimator
#> 
#>  DONE!
jackknife_estimates <- attr(pln$model_par$B, "estimates_jackknife") |> purrr::map("B")
betaJK_pln <- matrix(NA, p*d, n)
## Compare jackknife estimates to full set estimate
par(mfrow = c(4, 5))
for(i in 1:n){
  betaJK_pln[, i] <- as.vector(jackknife_estimates[[i]])
  plot(pln$model_par$B, jackknife_estimates[[i]],
       xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
       main = paste("Jackknife", i))
  abline(a = 0, b = 1, col = "red")
}

All jackknife estimators (computed within the PLN code) align perfectly with the full set estimators 😞


## Compute jackknife variance manually from jackknife estimates and compare with function output
## Use this form because the other leads to negative values due to numeric precision
betaSd1_manual <- sqrt((n-1)*(rowMeans( (betaJK_pln - rowMeans(betaJK_pln))^2)))
betaSd1_pln <- as.vector(standard_error(pln, type='jackknife'))
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd1_pln,
     xlab = "Coef. (manual comp.)", ylab = "Coeff. (pln comp.)",
     main = "Jackknife variance estimators")
abline(a = 0, b = 1, col = "red")

No problem when computing the jackknife variance estimators from the jackknife estimators (manual computations are equal to computations done in the package ✅ )


# Jacknife sd estimate by hand
betaJK <- matrix(NA, p*d, n)
par(mfrow = c(4, 5))
for(i in 1:n){
  # cat(i, sep = "\n")
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control = PLN_param(trace = 0))
  plot(pln$model_par$B, plnTmp$model_par$B,
       xlab = "Coef. (full set)", ylab = "Coeff. (jackknife)",
       main = paste("Jackknife", i))
  abline(a = 0, b = 1, col = "red")
  betaJK[, i]  <- as.vector(plnTmp$model_par$B)
}

The jackknife estimators (computed manually) don't align perfectly with the full set estimators ✅


betaSd2_manual <- sqrt((n-1)*(rowMeans( (betaJK - rowMeans(betaJK))^2)))

# Comparison
par(mfrow = c(1, 1))
plot(betaSd1_manual, betaSd2_manual,
     xlab = "Coef. (pln comp.)", ylab = "Coeff. (manual comp.)",
     main = "Jackknife variance estimators")

Hence the difference between the values from the package and the values computed by hand.

^{Created on 2023-11-16 with reprex v2.0.2}

The problem likely arises from the initialization in the jackknife loop: we reuse the values of $B$ and $S$ (but not $M$) estimated from the full set model and the iterative algorithm doesn't appear to move away from them.

https://github.com/PLN-team/PLNmodels/blob/2657e506675ad26e77df94f9c40f02ae11aa1042/R/PLNfit-class.R#L208-L218

mahendra-mariadassou commented 1 year ago

If we perturb $B$ in the initialization of the jackknife estimators (here a multiplicative perturbation),

https://github.com/PLN-team/PLNmodels/blob/f89d7d083c3cca2d671118206e6862b4d4d150cc/R/PLNfit-class.R#L215-L219

we get more credible results when comparing jackknife estimators to full set estimators

and in turn the standard deviations obtained "directly from the package" and "manually" are closer to each other but not yet perfectly aligned (beware of the log-scale), probably because of the difference in starting points:

scj-robin commented 1 year ago

The thing is that the manual estimates seem to better fit the package ones, when compared to empirical variability of the estimates observed when simulating several datasets with same true parameters. I copy below more or less the same code as the initial one increasing the number of observations et adding the empirical sd.

# Test Jackknife variance estimate

library(PLNmodels); library(mvtnorm)

# Make parms
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(barents$Abundance ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature)
X <- lm(barents$Abundance[, 1] ~ barents$Latitude + barents$Longitude + barents$Depth + barents$Temperature, x=TRUE)$x

# Simulate data
n <- 5e2
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
Y <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
d <- ncol(X)

# Jacknife sd estimate
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))

# Jacknife sd estimate by hand
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  if(i %% round(sqrt(n))==0){cat(i, '')}
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
cat('\n')
betaSdByHand <- sqrt((n-1)*(colMeans(betaJK^2) - (colMeans(betaJK)^2)))

# Empirical sd estimate 
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
  if(b %% round(sqrt(B))==0){cat(b, '')}
  Ytmp <- matrix(rpois(n*p, exp(X%*%fit$model_par$B + rmvnorm(n, sigma=fit$model_par$Sigma))), n, p)
  plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
  betaEmp[b, ]  <- as.vector(plnTmp$model_par$B)
}
betaSdEmpirical <- apply(betaEmp, 2, sd)

# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical)))

mahendra-mariadassou commented 1 year ago

I think this is the same problem. Right now, the jackknife estimators we get "from the package" are underdispersed because the starting point we use makes each jackknife estimator $B_{(i)}$ almost identical to the full dataset one $\hat{B}$.

When you do by it by hand, you use a different starting point for each jackknife dataset and you get jackknife estimator $B_{(i)}$ which are much more dispersed and mimic better the empirical estimate. We need to change the initialization for the jackknife estimator (in the package) to avoid being stuck in $\hat{B}$. The challenge consists in striking the right balance between recycling previous computations (e.g. reusing $\hat{B}$ as such, what's done in the package, which leads to underdispersion) and starting from scratch each time (what you did by hand, which is computationally more intensive).

jchiquet commented 1 year ago

Thank @scj-robin for the report and to @mahendra-mariadassou for the diagnostic.

As is often the case, we are reduced to a problem of local V-EM minima and/or difficult optimization in a flat area of the ELBO, with potentially strong variations in parameter values. As far as prediction is concerned, we don't really care, but to our great misfortune we do statistical models where parameter interpretation is important.

This is particularly visible on Stéphane's favorite data, Barent Fish, and not for the first time.

In any case, it's better to start again from scratch than to save calculation time. Easily parallelizable with future_lapply, which hasn't been done yet.

mahendra-mariadassou commented 1 year ago

TODO: change initialization in variance_jackknife() and variance_bootstrap() to use same starting point as model inferred "from scratch"

mahendra-mariadassou commented 1 year ago

I'm happy to report that we have a working fix:

library(PLNmodels); library(mvtnorm)

# Make params
data("barents")
n <- nrow(barents$Abundance); p <- ncol(barents$Abundance)
fit <- PLN(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)
X <- model.matrix(Abundance ~ Latitude + Longitude + Depth + Temperature, data = barents)

# Simulate data (n = 100)
n <- 100
X <- X[sample(1:nrow(X), n, replace=TRUE), ]
B0 <- fit$model_par$B; Sigma0 <- fit$model_par$Sigma
Y <- matrix(rpois(n*p, exp(X %*% B0 + mvtnorm::rmvnorm(n, sigma= Sigma0))), n, p)
d <- ncol(X)

# Jackknife sd estimate (using the package)
tictoc::tic()
pln <- PLN(Y ~ -1 + X, control=PLN_param(config_post = list(jackknife=TRUE)))
#> 
#>  Initialization...
#>  Adjusting a full covariance PLN model with nlopt optimizer
#>  Post-treatments...
#>   Computing jackknife variance estimator
#> 
#>  DONE!
tictoc::toc()
#> 4.296 sec elapsed

betaSdPackage <- as.vector(standard_error(pln, type='jackknife'))
jackknife_estimates <- attr(pln$model_par$B, "estimates_jaccknife") |> purrr::map("B")
betaJKPackage <- matrix(NA, n, p*d)
for(i in 1:n){
  betaJKPackage[i, ]  <- as.vector(jackknife_estimates[[i]])
}

# Jackknife sd estimate (by hand)
tictoc::tic()
betaJK <- matrix(NA, n, p*d)
for(i in 1:n){
  plnTmp <- PLN(Y[-i, ] ~ -1 + X[-i, ], control=PLN_param(trace=0))
  betaJK[i, ]  <- as.vector(plnTmp$model_par$B)
}
tictoc::toc()
#> 20.509 sec elapsed
betaSdByHand <- sqrt((n-1)*apply(betaJK, 2, function(x) { mean(x^2) - mean(x)^2 }))

We can check that the two options (by hand and through the package) return identical results with a figure

## Check that "automatic" and "manual" jackknife estimates are identical
plot(as.vector(betaJK), as.vector(betaJKPackage),
     ylab = "Computed from package", xlab = "Computed by hand",
     main = "Jackknife estimates (all coefficients, all samples)")
abline(0, 1, col = "red")

Or by looking at the difference between coefficients

max(abs(betaJK - betaJKPackage))
#> [1] 0

But using the package is faster (4.296 sec versus 20.509 sec in this example) than doing it by hand (less overhead). And the jackknife estimator now mimics correctly the empirical sd.

# Empirical sd estimate
B <- n
betaEmp <- matrix(NA, B, p*d)
for(b in 1:B){
  if(b %% 10==0){cat(b, '')}
  Ytmp <- matrix(rpois(n*p, exp(X %*% B0 + rmvnorm(n, sigma= Sigma0))), n, p)
  plnTmp <- PLN(Ytmp ~ -1 + X, control=PLN_param(trace=0))
  betaEmp[b, ]  <- as.vector(plnTmp$model_par$B)
}
#> 10 20 30 40 50 60 70 80 90 100
cat("\n")
betaSdEmpirical <- apply(betaEmp, 2, sd)

# Comparison
plot(as.data.frame(cbind(betaSdPackage, betaSdByHand, betaSdEmpirical)))

^{Created on 2023-11-17 with reprex v2.0.2}

@jchiquet, I'll clean the code and do the PR.

jchiquet commented 1 year ago

Great news. Thanks a lot for the fix: how did you manage to do it?

mahendra-mariadassou commented 1 year ago

Not in a very smart way, I essentially extracted the computations for the starting point of $M$, $B$ and $S$ into their own function:

https://github.com/PLN-team/PLNmodels/blob/c740f2d291ff700a5966e4fe469d234609340b1c/R/utils.R#L256-L265

and used it in variance_*() functions. https://github.com/PLN-team/PLNmodels/blob/c740f2d291ff700a5966e4fe469d234609340b1c/R/PLNfit-class.R#L218-L223

I need to check that the tests pass and we should be good.

jchiquet commented 1 year ago

Efficiency and accuracy sound smart to me.

PLN-team / PLNmodels

Jackknife estimate of the variance #112