Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

Likelihood of covariate model #15

Closed sachaMorin closed 1 year ago

sachaMorin commented 1 year ago

We currently treat the covariate model as any other emission model, meaning we essentially fit class weights p(X_i), a measurement model p(Y_i | X_i) and a covariate (structural) model p(X_i | Z_p) and use those factors in a mixture likelihood. Since we specify p(X_i | Z_p) and not p(Z|X_o), I think our current conditional likelihood formulation for the E-step is wrong for the covariate case.

Reading Two-step estimation of models between latent classes and external variables, Bakk and Kuha, 2018, the likelihood factorization includes no marginal on X_i (eq. 2) and the marginal p(Z_p) is used instead (and in practice ignored, see bottom of p. 8), yielding a simplified likelihood based on the p(X_i | Z_p) and the p(Y_i | X_i) factors.

I ended up reviewing this since our covariate simulation was underperforming (scripts/run_bakk_simulation.py). Ignoring the contribution of the class weights in the StepMix E-step (see below) improves performance and I believe matches the Bakk covariate likelihood.

https://github.com/Labo-Lacourse/stepmix/blob/ed254272affe36428d2720b4d4f49468e6168fcd/stepmix/stepmix.py#L860

Should we add a condition to turn off the class weights in the E-step if we have a covariate structural model?

sachaMorin commented 1 year ago

A likely cause of Issue #7