Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

Tricky Results - Potential Bug #60

Closed yuanjames closed 3 months ago

yuanjames commented 4 months ago

Hi,

I recently run LCA with measurement = binary, the results show there were 13 classess in total, however, I found there were 6 (i.e., classes: 1,2,4,5,6,9) classes are exactly same according to model.get_mm_df(). Then, I went to on model.predict(X), I found 1,2,4,5,9 class labels were missing, there were not any data (x) assigned to these classes. So, I manully merged them.

Also, I checked the crosstab, the forementioned classes were missing as well. The number of classes in total is identified by grid search, I assume 13 can produce better metric value, but the fact is there were only 8 classes in total.

Does anyone know the reason?

sachaMorin commented 4 months ago

Thanks for reporting this.

  1. Can you check the observations from classes 1,2,4,5,6,9? Specifically, are they identical or extremely similar?
  2. Have you tried fitting an estimator with fewer classes? I would consider setting n_components=8.
  3. Some classes never getting predicted can happen. The class prediction is an argmax over the probability of belonging to each class. You can check those probabilities directly with predict_proba.
yuanjames commented 4 months ago

Thanks for reporting this.

  1. Can you check the observations from classes 1,2,4,5,6,9? Specifically, are they identical or extremely similar?
  2. Have you tried fitting an estimator with fewer classes? I would consider setting n_components=8.
  3. Some classes never getting predicted can happen. The class prediction is an argmax over the probability of belonging to each class. You can check those probabilities directly with predict_proba.

Hi,

  1. I could not check the observations from 1,2,4,5, and 9, because no observation is classfied with these labels. I checked observations in class 6, yes, they are identical.
  2. Yes, I tried grid search for the parameter of class number, it shows 13 is the best one. Also, I tried 8, then it only gives me 5 classes in crosstab.
  3. Thanks for your answer, I will check, much appreciated for great work, I like Stepmix.
sachaMorin commented 4 months ago

Given that the 6 classes are identical in terms of parameters, you should see very similar probabilities in predict_proba for the observations that get assigned to class 6. I suspect 6 gets predicted essentially because it's numerically slightly more likely.

What seems to be happening here is that multiple classes latch on to the same data cluster.

I would consider testing different validation metrics, including AIC or BIC to penalize unnecessarily complex models. You can also plot metrics for validation with different components (we did something similar in this tutorial). 13 components might get selected as the best fit, but you might observe an elbow at n_components < 13 and then a plateau with negligible improvements.

sachaMorin commented 3 months ago

@yuanjames are you still stuck with this? I will close, but feel free to reopen if needed.