PLN-team / PLNmodels

A collection of Poisson lognormal models for multivariate count data analysis
https://pln-team.github.io/PLNmodels
GNU General Public License v3.0
54 stars 18 forks source link

Big variation of the 'loglik' after one additional VE step #91

Closed scj-robin closed 1 year ago

scj-robin commented 2 years ago

The ELBO (named 'loglik' in the output of the PLN function) seem to vary dramatically when performing one additional VE step, while convergence is supposed to be reached.

Considering the Barents data with no offset, Y = counts (n=59, p=30)and X = intercept + covariates (d=5), we get:

fit <- PLN(Y ~ -1+X, control=list(trace=0))
VE <- fit$VEstep(covariates=X, offsets=matrix(0, n, p), responses=Y, weights = rep(1,n))
print(c(fit$loglik, sum(VE$log.lik)))
> [1] -8989.666 -6655.017

Using a homemade function to calculate the ELBO:

ELBO <- function(Y, X, M, S2, Theta, Sigma){
   n <- nrow(Y); p <- ncol(Y); Omega <- solve(Sigma); Beta <- t(Theta)
   EqLogPZ <- -n/2 * log(det(Sigma)) -0.5 * sum(diag(M %*% Omega %*% t(M))) - 
     0.5 * sum(sapply(1:n, function(i){diag(Omega%*%diag(S2[i, ]))}))
   EqLogPY.Z <- -sum(exp(X%*%Beta + M + S2/2)) + sum(Y * (X %*% Beta + M)) - sum(lgamma(Y+1))
   EqLogQZ <- -0.5 * sum(log(S2)) -n*p/2
   return(list(EqLogPZ=EqLogPZ, EqLogPY.Z=EqLogPY.Z, EqLogQZ=EqLogQZ, 
               elbo=EqLogPZ + EqLogPY.Z - EqLogQZ))
 }

we get different but similar results:

print(c(ELBO(Y, X, fit$var_par$M, fit$var_par$S2, fit$model_par$Theta, fit$model_par$Sigma)$elbo, 
        ELBO(Y, X, VE$M, VE$S2, fit$model_par$Theta, fit$model_par$Sigma)$elbo))
> [1] -8989.132 -5630.619

It seems that the variational parameters M and S2 vary quite a lot during the additional VE step:

print(c(max(abs(fit$var_par$M - VE$M)), max(abs(fit$var_par$S2 - VE$S2))))
> [1] 1.964532 1.083126
mahendra-mariadassou commented 2 years ago

2 remarks:

library(PLNmodels)
load("../Data/BarentsFish/BarentsFish.Rdata")
n <- nrow(Data$count)
p <- ncol(Data$count)
X <- Data$covariates
Y <- Data$count
O <- matrix(0, n, p)
w <- rep(1, n)
fit <- PLN(count ~ -1 + covariates, control=list(trace=1), data = Data)
fit_new <- fit$clone(deep = TRUE)
fit_new$optimize(covariates=X, offsets=O, responses=Y, weights = w, control = PLNmodels:::PLN_param(NULL, n, p))
print(c(fit$loglik, fit_new$loglik))
[1] -9035.610 -8755.656

And produce an even greater jump by manually updating the variational parameters (with VEstep()) before calling optimize()

VE <- fit$VEstep(covariates= X, offsets= O, responses= Y, weights = w)
fit_new$update(M = VE$M, S2 = VE$S2)
fit_new$optimize(covariates=X, offsets=O, responses=Y, weights = w, control = PLNmodels:::PLN_param(NULL, n, p))
print(c(fit$loglik, fit_new$loglik))
[1] -9035.61 -5343.67

And repeat the process

VE <- fit_new$VEstep(covariates= X, offsets= O, responses= Y, weights = w)
fit_new$update(M = VE$M, S2 = VE$S2)
fit_new$optimize(covariates=X, offsets=O, responses=Y, weights = w, control = PLNmodels:::PLN_param(NULL, n, p))
print(c(fit$loglik, fit_new$loglik))
[1] -9035.610 -5077.922

This looks (at least for the first block) like a problem with the optimizer although for all model we have

fit$optim_par$message
[1] "xtol_rel or xtol_abs was reached"
fit_new$optim_par$message
[1] "xtol_rel or xtol_abs was reached"
mahendra-mariadassou commented 2 years ago

After a bit of investigation, it seems like convergence is incomplete with the default parameters. If I change xtol_rel to 1e-6 (instead of the default 1e-4)

library(PLNmodels)
#> This is packages 'PLNmodels' version 0.11.6-9200
#> Use future::plan(multicore/multisession) to speed up PLNPCA/PLNmixture/stability_selection.
load("~/Documents/PLN/Data/BarentsFish/BarentsFish.Rdata")
n <- nrow(Data$count)
p <- ncol(Data$count)
X <- Data$covariates
Y <- Data$count
O <- matrix(0, n, p)
w <- rep(1, n)
fit <- PLN(count ~ -1 + covariates, data = Data, control = list(xtol_rel = 1e-6, trace = 1))
#> 
#>  Initialization...
#>  Adjusting a PLN model with full covariance model
#>  Post-treatments...
#>  DONE!
fit$optim_par
#> $iterations
#> [1] 3533
#> 
#> $message
#> [1] "ftol_rel or ftol_abs was reached"
fit_new <- fit$clone(deep = TRUE)
fit_new$optimize(covariates=X, offsets=O, responses=Y, weights = w, control = PLNmodels:::PLN_param(list(xtol_rel = 1e-6), n, p))
fit_new$optim_par
#> $iterations
#> [1] 11
#> 
#> $message
#> [1] "xtol_rel or xtol_abs was reached"
print(c(fit$loglik, fit_new$loglik))
#> [1] -4401.694 -4401.691
VE <- fit$VEstep(covariates= X, offsets= O, responses= Y, weights = w, control = PLNmodels:::PLN_param(list(xtol_rel = 1e-6), n, p))
fit_new$update(Theta = fit$model_par$Theta, Sigma = fit$model_par$Sigma, M = VE$M, S2 = VE$S2)
fit_new$optimize(covariates=X, offsets=O, responses=Y, weights = w, control = PLNmodels:::PLN_param(list(xtol_rel = 1e-6), n, p))
fit_new$optim_par
#> $iterations
#> [1] 11
#> 
#> $message
#> [1] "xtol_rel or xtol_abs was reached"
print(c(fit$loglik, fit_new$loglik))
#> [1] -4401.694 -4401.673

We can check visually that the parameters are qualitatively comparable in fit and fit_new.

par(mfrow = c(2, 2))
plot(fit$var_par$M ~ fit_new$var_par$M);abline(a = 0,b = 1, col = "red")
plot(fit$var_par$S2 ~ fit_new$var_par$S2);abline(a = 0,b = 1, col = "red")
plot(fit$model_par$Theta ~ fit_new$model_par$Theta);abline(a = 0,b = 1, col = "red")
par(mfrow = c(1, 1))

In summary, the main problem with the code appears to be the default value of xtol_rel which leads to faster but incomplete optimization (+ the typo that will be fixed)

Created on 2022-08-09 by the reprex package (v2.0.1)

scj-robin commented 2 years ago

Thank you for your superfast answer. I actually do not get the same numerical value when running your code on my laptop (I wonder if some difference may exist between our respective datafiles). In conclusion, do you suggest that, when mostly interested in the value of the elbo and/or the variational parameters, we always run the code with the control option xtol_rel = 1e-6, or even less?

mahendra-mariadassou commented 2 years ago

The differences are probably due to my computer having a slightly different version of R/C++ than yours. The question about the elbo is a tricky one ! I don't have general suggestions apart from checking convergence like you did:

scj-robin commented 2 years ago

Using homemade functions for both the VE and the M step, it seems that the biggest changes occur in S2. Could the problem be related to some boundary imposed to S2 in the VEM algorithm?

jchiquet commented 2 years ago

Hi,

Thank Stéphane for raising this issue and to Mahendra for his responsiveness.

Some comments afterwards:

When we set the parameters of the optimization methods in PLNmodels, we did our best to reach a good compromise between good optimization and performance. We have checked on a number of problems that the optimum was reached. This is obviously not the case for barent fish data. I am not too much in favor of lowering the tolerance threshold of the convergence of parameters (xtol). We use the default of the NLOpt optimization library and I think it's fine like that.

On the other hand, I'm considering integrating Bastien's code as an alternative solution for the optimization, which uses the Rprop optimizer from torch, now well interfaced with R... It works very well, but it probably means more effort on the user's side to check that the model has been properly adjusted.

Finally, concerning the remark about the optimization in S²: in fact, we optimize in S such that S*S = S² by not putting any constraint on S, so the problem pointed out by Stéphane is not directly linked to the big variations of the ELBO, but maybe indirectly.

jchiquet commented 2 years ago

Ok, just to let you know that, after a couple of tests,