Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
60 stars 4 forks source link

Issue with n_init values while using bootstrap() #24

Closed FelixLaliberte closed 1 year ago

FelixLaliberte commented 1 year ago

Hi,

When using the bootstrap() function, can we use the n_init argument? It seems to work when n_init=1 and n_init=2. Strangely, when n_init = 3 (or more), this error message is displayed:

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:285: RuntimeWarning: divide by zero encountered in divide precision_c = 1 / self.parameters["covariances"][c] /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:286: RuntimeWarning: invalid value encountered in multiply ll_diff = ((diff*2) precision_c.reshape(1, -1)).sum(axis=1) /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:294: RuntimeWarning: divide by zero encountered in log log_dets = np.log(pi_cov_c).sum(axis=1) /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:296: RuntimeWarning: invalid value encountered in add log_eps[:, c] = -0.5 * (log_dets + ll_diff)

Here is an example with the Iris dataset:

import pandas as pd import numpy as np from stepmix.stepmix import StepMix from stepmix.utils import get_mixed_descriptor from stepmix.bootstrap import bootstrap from sklearn.datasets import load_iris from sklearn.metrics import rand_score import matplotlib.pyplot as plt

data, target = load_iris(return_X_y=True, as_frame=True)

for c in data: c_categorical = c.replace("cm", "cat") data[c_categorical] = pd.qcut(data[c], q=3).cat.codes data[c_binary] = pd.qcut(data[c], q=2).cat.codes

mm_data, mm_descriptor = get_mixed_descriptor( dataframe=data, continuous_nan=['sepal length (cm)', 'sepal width (cm)'], binary_nan=['sepal length (binary)', 'sepal width (binary)'], categorical_nan=['sepal length (cat)', 'sepal width (cat)'], )

sm_data, sm_descriptor = get_mixed_descriptor( dataframe=data, categorical_nan=['petal length (cat)', 'petal width (cat)'], )

model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=2, random_state=123) model.fit(mm_data, sm_data) model, bootstrapped_params = bootstrap(model, mm_data, sm_data, n_repetitions=100) #ok

model2 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=3, random_state=123) model2.fit(mm_data, sm_data) model2, bootstrapped_params2 = bootstrap(model2, mm_data, sm_data, n_repetitions=100) #not ok

model3 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, random_state=123) model3.fit(mm_data, sm_data) model3, bootstrapped_params3 = bootstrap(model3, mm_data, sm_data, n_repetitions=100) #ok

model4 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=1, random_state=123) model4.fit(mm_data, sm_data) model4, bootstrapped_params4 = bootstrap(model4, mm_data, sm_data, n_repetitions=100) #ok

model5 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=100, random_state=123) model5.fit(mm_data, sm_data) model5, bootstrapped_params5 = bootstrap(model5, mm_data, sm_data, n_repetitions=100) #not ok

model6 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=3, random_state=123) model6.fit(mm_data, sm_data) model6, bootstrapped_params6 = bootstrap(model6, mm_data, sm_data, n_repetitions=1000) # also not ok at n_repetitions=10000

sachaMorin commented 1 year ago

I cannot reproduce. I had to change your example for it to run: a line was missing in the binary data definition. The following runs fine here.

import pandas as pd
from stepmix.stepmix import StepMix
from stepmix.utils import get_mixed_descriptor
from stepmix.bootstrap import bootstrap
from sklearn.datasets import load_iris

data, target = load_iris(return_X_y=True, as_frame=True)

for c in data:
    c_categorical = c.replace("cm", "cat")
    c_binary = c.replace("cm", "binary")  # This was missing in the original code
    data[c_categorical] = pd.qcut(data[c], q=3).cat.codes
    data[c_binary] = pd.qcut(data[c], q=2).cat.codes

mm_data, mm_descriptor = get_mixed_descriptor(
dataframe=data,
continuous_nan=['sepal length (cm)', 'sepal width (cm)'],
binary_nan=['sepal length (binary)', 'sepal width (binary)'],
categorical_nan=['sepal length (cat)', 'sepal width (cat)'],
)

sm_data, sm_descriptor = get_mixed_descriptor(
dataframe=data,
categorical_nan=['petal length (cat)', 'petal width (cat)'],
)

# Model 2 is not okay according to report, but okay here
print("Model 2...")
for random_state in range(10):  # Try 10 random states
    print(f"random_state {random_state}")
    model2 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor,
                     n_init=3, random_state=random_state)
    model2.fit(mm_data, sm_data)
    model2, bootstrapped_params2 = bootstrap(model2, mm_data, sm_data, n_repetitions=100)

Edit: Fixed some typos and removed model 5 to focus the discussion.

FelixLaliberte commented 1 year ago

The missing line of code is due to copy/paste error from a notebook to GitHub. Sorry about that.

The problem was that the Iris dataset had too much missing data. With less missing data, the previous example (and yours) works fine:

import pandas as pd import numpy as np from stepmix.stepmix import StepMix from stepmix.utils import get_mixed_descriptor from stepmix.bootstrap import bootstrap from sklearn.datasets import load_iris

data, target = load_iris(return_X_y=True, as_frame=True)

for c in data: c_categorical = c.replace("cm", "cat") data[c_categorical] = pd.qcut(data[c], q=3).cat.codes c_binary = c.replace("cm", "binary") data[c_binary] = pd.qcut(data[c], q=2).cat.codes

for i, c in enumerate(data.columns): data[c] = data[c].sample(frac=.9, random_state=42*i) # frac=.5 is changed to frac=.9

mm_data, mm_descriptor = get_mixed_descriptor( dataframe=data, continuous_nan=['sepal length (cm)', 'sepal width (cm)'], binary_nan=['sepal length (binary)', 'sepal width (binary)'], categorical_nan=['sepal length (cat)', 'sepal width (cat)'], )

sm_data, sm_descriptor = get_mixed_descriptor( dataframe=data, categorical_nan=['petal length (cat)', 'petal width (cat)'], )

With n_init=10

model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=10, random_state=123) model.fit(mm_data, sm_data)

model, bootstrapped_params = bootstrap(model, mm_data, sm_data, n_repetitions=100)

Thank you for your help!

FelixLaliberte commented 1 year ago

I have a question that follows the previous one.

When n_init is used, how can we select the "best" model from multiple run/initialization(n_init) so that only this one is used during bootstrapping? When running the following code, it seems that each of the 500 models (n_init=500) are bootstrapped. Is this the case?

import pandas as pd import numpy as np from stepmix.stepmix import StepMix from stepmix.utils import get_mixed_descriptor from stepmix.bootstrap import bootstrap from sklearn.datasets import load_iris

data, target = load_iris(return_X_y=True, as_frame=True)

for c in data: c_categorical = c.replace("cm", "cat") data[c_categorical] = pd.qcut(data[c], q=3).cat.codes c_binary = c.replace("cm", "binary") data[c_binary] = pd.qcut(data[c], q=2).cat.codes

for i, c in enumerate(data.columns): data[c] = data[c].sample(frac=.9, random_state=42*i)

mm_data, mm_descriptor = get_mixed_descriptor( dataframe=data, continuous_nan=['sepal length (cm)', 'sepal width (cm)'], binary_nan=['sepal length (binary)', 'sepal width (binary)'], categorical_nan=['sepal length (cat)', 'sepal width (cat)'], )

sm_data, sm_descriptor = get_mixed_descriptor( dataframe=data, categorical_nan=['petal length (cat)', 'petal width (cat)'], )

model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=500, random_state=123, verbose=1) model.fit(mm_data, sm_data)

model, bootstrapped_params = bootstrap(model, mm_data, sm_data, n_repetitions=10)

sachaMorin commented 1 year ago

As detailed in the boostrap documentation, bootstrap first fits the estimator on the original data then runs the repetitions. This is why you first see the progress bar for n_init and then for the bootstrap repetitions. This means you do not need to call the fit method when using the bootstrap function.

When you set n_init > 1, the model always picks the best parameters. This is also true for the repetitions in the bootstrap loop.

sachaMorin commented 1 year ago

I'm considering not fitting the main estimator in the bootstrap function and requiring the user to call fit before using bootstrap (Like you did). A user may need to call bootstrap on an estimator he already fitted for other purposes, or may want to bootstrap using different data.

sachaMorin commented 1 year ago

Moved the bootstrap idea to #36

sachaMorin commented 1 year ago

@FelixLaliberte did this answer your question? Feel free to close this issue if it did. Thanks,

FelixLaliberte commented 1 year ago

Yes it did.

Thanks,