Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

Additional inference measures (LMR) #46

Open pcomw opened 1 year ago

pcomw commented 1 year ago

Once again, thank you for the wonderful work.

I hope to switch over to Stepmix from MPlus, but a few tests that my group uses to evaluate models with different numbers of classes aren't yet in the package, and I was curious about the roadmap.

In the future, are there plans to add other inference measures to the stepmix class? I am thinking in particular of other IC and LRT-type stats:

  1. Sample-size adjusted BIC, e.g., -2 * model.score(X) * X.shape[0] + model.n_parameters * np.log((X.shape[0] + 2) / 24)

  2. CAIC, e.g., -2 * model.score(X) * X.shape[0] + model.n_parameters * (np.log(X.shape[0]) + 1)

  3. Bootstrap likelihood ratio test (BLRT). E.g., page 543 of https://doi.org/10.1080/10705510701575396

  4. Possibly also the Lo–Mendell–Rubin (LMR) and/or Vuong–Lo–Mendell–Rubin (VLMR) tests.

The IC stats are simple enough to implement, but the BLRT would take a little time.

Thanks again

sachaMorin commented 1 year ago

Thanks for the suggestions! Sample-size adjusted BIC and CAIC seem easy enough. I'm not 100% about BLRT but I would like to hear StepMix power user @FelixLaliberte's thoughts on this. Is this similar to what you are working on?

FelixLaliberte commented 1 year ago

Hi @pcomw,

I think all the inference measures you suggest are commonly used and would be much appreciated by users. Thanks for the suggestions!

@sachaMorin: To answer your question, the BLRT would indeed be very useful.

sachaMorin commented 1 year ago

Reviewing this paper, it seems the suggested Sample-Adjusted BIC was missing an X.shape[0] in the log. The current suggested implementation is

        n = X.shape[0]

        return -2 * self.score(X, Y) * n + self.n_parameters * np.log(
            n * ((n + 2) / 24)
        )
pcomw commented 1 year ago

I agree, it seems in the paper you linked that it should be n * ((n + 2) / 24)

However, in the MPlus output, there is a line of text that reads, (n* = (n + 2) / 24), which implies that the asterisk isn't a sign of multiplication, but a signifier of a different value. Similarly, on page 545 of https://doi.org/10.1080/10705510701575396, the text defines SABIC as replacing n in the original BIC calculation with n*, as seen here:

Annotation 2023-08-01 090030

I suppose that the 1987 Sclove paper is the proper resource to check, but I don't have access to it.

I also came across a post with references about the VLMR-LRT in the tidyLPA github that might be relevant, if you choose to implement that in addition the the BLRT: https://github.com/data-edu/tidyLPA/issues/178#issuecomment-951617013

sachaMorin commented 1 year ago

Linking this paper here for future reference on BLRT. I feel StepMix has all the building blocks to do this. Will try to include BLRT in the next release.

sachaMorin commented 1 year ago

CAIC and SABIC are available as of version 2.0.0.

sachaMorin commented 7 months ago

BLRT is available as of version 2.2.0. See the updated tutorial.