Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
57 stars 4 forks source link

LCA - Data order #61

Closed yuanjames closed 2 weeks ago

yuanjames commented 4 months ago

Hi,

I have recently conducted a series of experiments, I found it is tricky that the results changed when I shuffled the data (other settings same).

I am curious that LCA should have same results, but the shuffled data may change the convergence? am I right? If we want to have the same results, we may need to change parameters of LCA.

sachaMorin commented 4 months ago

By other settings name, do you mean you’re shuffling the order of the variables?

Shuffling should not affect the fit quality of the overall model, but could affect the order of the parameters. It would be really helpful if you could provide a minimum example to reproduce what you observed, perhaps with one of the datasets in stepmix.datasets.

yuanjames commented 4 months ago

Sorry, I just realised that I made one mistake yesterday, so I have updated the example I used, please check @sachaMorin.

df, target = load_iris(return_X_y=True, as_frame=True)
df['iris_flower_type'] = target.map({0:'setosa', 1:'versicolor', 2:'virginica'})
df = df.sample(frac=1) # shuffle
continuous_features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
                 'petal width (cm)']
continuous_data = df[continuous_features]
continuous_data = continuous_data.sample(frac=1)
model = StepMix(n_components=5, measurement="continuous", verbose=1, random_state=123)
model.fit(continuous_data)
df['continuous_pred'] = model.predict(continuous_data)

Every time I run the code, it gives me different crosstab results. For example,

No 1. continuous_pred 0 1 2 3 4

0 | 50 | 0 | 0 | 0 23 | 0 | 23 | 4 | 0 1 | 0 | 0 | 20 | 29

No 2.

continuous_pred 0 1 2 3 4

0 | 0 | 50 | 0 | 0 24 | 16 | 0 | 10 | 0 15 | 0 | 0 | 0 | 35

yuanjames commented 4 months ago

If my understanding is correct, I think if one LCA model with fixed hyperparameters can always reach the convergence after shuffling the data, then shuffling won't change the crosstab results. However, if the LCA can't reach the convergence, then shuffling did change the results.

I tried n_component = 2 or 3, shuffling did not change results, once I changed it to 5, as the above example shows, it changed the results. am I correct?

sachaMorin commented 4 months ago

Looking at your previous results, the clusterings still look good. Each cluster captures a class (or a part of it if you have more clusters than classes).

It's also possible that this is caused by numerical issues. For example, the sum of an ndarray may actually vary slightly if you shuffle the elements due to the summing order. See the following program:

import numpy as np
np.random.seed(123)
a = np.random.random(100)
b = np.copy(a)
np.random.shuffle(b)
sum_a = np.sum(a)
sum_b = np.sum(b)
print(sum_a)
print(sum_b)
print(sum_a == sum_b)

Output:

50.14288800514812
50.142888005148116
False

Given the numerous sums and means taken in the StepMix estimation, those small differences can compound over time and could potentially explain what we're seeing here. I'm not sure and would be interested in seeing how other libraries behave.

sachaMorin commented 2 weeks ago

Closing. Feel free to reopen if you still want to discuss.