Closed yuanjames closed 2 months ago
By other settings name, do you mean you’re shuffling the order of the variables?
Shuffling should not affect the fit quality of the overall model, but could affect the order of the parameters. It would be really helpful if you could provide a minimum example to reproduce what you observed, perhaps with one of the datasets in stepmix.datasets
.
Sorry, I just realised that I made one mistake yesterday, so I have updated the example I used, please check @sachaMorin.
df, target = load_iris(return_X_y=True, as_frame=True)
df['iris_flower_type'] = target.map({0:'setosa', 1:'versicolor', 2:'virginica'})
df = df.sample(frac=1) # shuffle
continuous_features = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)']
continuous_data = df[continuous_features]
continuous_data = continuous_data.sample(frac=1)
model = StepMix(n_components=5, measurement="continuous", verbose=1, random_state=123)
model.fit(continuous_data)
df['continuous_pred'] = model.predict(continuous_data)
Every time I run the code, it gives me different crosstab results. For example,
No 1. continuous_pred | 0 | 1 | 2 | 3 | 4 |
---|
0 | 50 | 0 | 0 | 0 23 | 0 | 23 | 4 | 0 1 | 0 | 0 | 20 | 29
No 2.
continuous_pred | 0 | 1 | 2 | 3 | 4 |
---|
0 | 0 | 50 | 0 | 0 24 | 16 | 0 | 10 | 0 15 | 0 | 0 | 0 | 35
If my understanding is correct, I think if one LCA model with fixed hyperparameters can always reach the convergence after shuffling the data, then shuffling won't change the crosstab results. However, if the LCA can't reach the convergence, then shuffling did change the results.
I tried n_component = 2 or 3, shuffling did not change results, once I changed it to 5, as the above example shows, it changed the results. am I correct?
Looking at your previous results, the clusterings still look good. Each cluster captures a class (or a part of it if you have more clusters than classes).
It's also possible that this is caused by numerical issues. For example, the sum of an ndarray may actually vary slightly if you shuffle the elements due to the summing order. See the following program:
import numpy as np
np.random.seed(123)
a = np.random.random(100)
b = np.copy(a)
np.random.shuffle(b)
sum_a = np.sum(a)
sum_b = np.sum(b)
print(sum_a)
print(sum_b)
print(sum_a == sum_b)
Output:
50.14288800514812
50.142888005148116
False
Given the numerous sums and means taken in the StepMix estimation, those small differences can compound over time and could potentially explain what we're seeing here. I'm not sure and would be interested in seeing how other libraries behave.
Closing. Feel free to reopen if you still want to discuss.
Hi,
I have recently conducted a series of experiments, I found it is tricky that the results changed when I shuffled the data (other settings same).
I am curious that LCA should have same results, but the shuffled data may change the convergence? am I right? If we want to have the same results, we may need to change parameters of LCA.