Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

Suggested modifications of outputs #26

Open FelixLaliberte opened 1 year ago

FelixLaliberte commented 1 year ago

Hello,

It would be appreciated if some elements of the outputs could be changed, so that StepMix benefits from outputs similar to those found in other software and packages. Here is a list of suggestions:

1) Would it be possible to add the maximum value of log-likelihood in the "fit" section of the output? This value will make it easier to compare the performance of StepMix to other software/packages. Similarly, would it be possible to add the entropy index there?

2) Is it possible to add an argument to the StepMix function that allows to output "live" the maximum value of the log-likelihood for each repetition (initialization), that is at the end of each repetition? A "dynamic" output would allow users to estimate how long the model will take to converge. In StepMix, it is currently quite difficult to determine after 15 minutes of waiting whether it will take another 3 hours or 5 minutes before the output is released.

Here is an example of output in the R package poLCA, where the maximum value of log-likelihood is printed at the end of each repetition:

image

3) Would it be possible to modify the structure of the conditional probabilities section? For example, we get this output in a model with only 3 items (1 item with 3 categories and 2 items with 5 categories) :

image

In the best case, it would be nice to have an APA type output (i.e. what can be found in articles). For example:

Capture d’écran, le 2023-05-11 à 13 12 03

If it is too complex to implement, here is an example in poLCA. The output is much easier to read, especially in an exploratory research context. It would be nice if StepMix had a similar output, preferably with latent classes in columns.

image

4) In models with covariates, it might be preferable to have the results with a reference class in the output (i.e. as in multinomial regression). Here is an example of output in poLCA from a model with 4 classes and a covariate (cov5) to 5 categories (cov50, cov51, ... cov54):

image

sachaMorin commented 1 year ago

Regarding 1, the average log-lkelihood is already part of the output:

    ============================================================================
    Fit for 3 latent classes
    ============================================================================
    Estimation method             : 1-step
    Number of observations        : 150
    Number of latent classes      : 3
    Number of estimated parameters: 20
    Average log-likelihood        : -4.0430
    AIC                           : 1252.90
    BIC                           : 1313.12

The "average" actually refers to the average over samples for the best estimator. This is in fact the "best" or "max" likelihood over all initializations. I can maybe update the string to make that point clearer.

sachaMorin commented 1 year ago

I agree with point 2. I will add a notion of progress bar or dynamic print so the user knows something is going on. This is basically issue #22

sachaMorin commented 1 year ago

I may take care of the "indice d'entropie" depending on the complexity. Can you maybe provide a clear reference on how to compute it in a separate feature request?

sachaMorin commented 1 year ago

Points 3 and 4 make sense, but I won't personally be implementing them. They should be discussed with Éric and added to the StepMix roadmap.

sachaMorin commented 1 year ago

Regarding 1, the average log-lkelihood is already part of the output:

    ============================================================================
    Fit for 3 latent classes
    ============================================================================
    Estimation method             : 1-step
    Number of observations        : 150
    Number of latent classes      : 3
    Number of estimated parameters: 20
    Average log-likelihood        : -4.0430
    AIC                           : 1252.90
    BIC                           : 1313.12

The "average" actually refers to the average over samples for the best estimator. This is in fact the "best" or "max" likelihood over all initializations. I can maybe update the string to make that point clearer.

Moved this to issue #27

sachaMorin commented 1 year ago

Moved entropy to #32

sachaMorin commented 7 months ago

I feel like suggestion 3 (better outputs) has been addressed with StepMix v2, which now prints much nicer parameter DataFrames. If @FelixLaliberte agrees, I would rename this issue to better reflect suggestion 4 (covariate results with a reference class).