Closed FelixLaliberte closed 1 year ago
I cannot reproduce. I had to change your example for it to run: a line was missing in the binary data definition. The following runs fine here.
import pandas as pd
from stepmix.stepmix import StepMix
from stepmix.utils import get_mixed_descriptor
from stepmix.bootstrap import bootstrap
from sklearn.datasets import load_iris
data, target = load_iris(return_X_y=True, as_frame=True)
for c in data:
c_categorical = c.replace("cm", "cat")
c_binary = c.replace("cm", "binary") # This was missing in the original code
data[c_categorical] = pd.qcut(data[c], q=3).cat.codes
data[c_binary] = pd.qcut(data[c], q=2).cat.codes
mm_data, mm_descriptor = get_mixed_descriptor(
dataframe=data,
continuous_nan=['sepal length (cm)', 'sepal width (cm)'],
binary_nan=['sepal length (binary)', 'sepal width (binary)'],
categorical_nan=['sepal length (cat)', 'sepal width (cat)'],
)
sm_data, sm_descriptor = get_mixed_descriptor(
dataframe=data,
categorical_nan=['petal length (cat)', 'petal width (cat)'],
)
# Model 2 is not okay according to report, but okay here
print("Model 2...")
for random_state in range(10): # Try 10 random states
print(f"random_state {random_state}")
model2 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor,
n_init=3, random_state=random_state)
model2.fit(mm_data, sm_data)
model2, bootstrapped_params2 = bootstrap(model2, mm_data, sm_data, n_repetitions=100)
Edit: Fixed some typos and removed model 5 to focus the discussion.
The missing line of code is due to copy/paste error from a notebook to GitHub. Sorry about that.
The problem was that the Iris dataset had too much missing data. With less missing data, the previous example (and yours) works fine:
import pandas as pd import numpy as np from stepmix.stepmix import StepMix from stepmix.utils import get_mixed_descriptor from stepmix.bootstrap import bootstrap from sklearn.datasets import load_iris
data, target = load_iris(return_X_y=True, as_frame=True)
for c in data: c_categorical = c.replace("cm", "cat") data[c_categorical] = pd.qcut(data[c], q=3).cat.codes c_binary = c.replace("cm", "binary") data[c_binary] = pd.qcut(data[c], q=2).cat.codes
for i, c in enumerate(data.columns): data[c] = data[c].sample(frac=.9, random_state=42*i) # frac=.5 is changed to frac=.9
mm_data, mm_descriptor = get_mixed_descriptor( dataframe=data, continuous_nan=['sepal length (cm)', 'sepal width (cm)'], binary_nan=['sepal length (binary)', 'sepal width (binary)'], categorical_nan=['sepal length (cat)', 'sepal width (cat)'], )
sm_data, sm_descriptor = get_mixed_descriptor( dataframe=data, categorical_nan=['petal length (cat)', 'petal width (cat)'], )
model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=10, random_state=123) model.fit(mm_data, sm_data)
model, bootstrapped_params = bootstrap(model, mm_data, sm_data, n_repetitions=100)
Thank you for your help!
I have a question that follows the previous one.
When n_init is used, how can we select the "best" model from multiple run/initialization(n_init) so that only this one is used during bootstrapping? When running the following code, it seems that each of the 500 models (n_init=500) are bootstrapped. Is this the case?
import pandas as pd import numpy as np from stepmix.stepmix import StepMix from stepmix.utils import get_mixed_descriptor from stepmix.bootstrap import bootstrap from sklearn.datasets import load_iris
data, target = load_iris(return_X_y=True, as_frame=True)
for c in data: c_categorical = c.replace("cm", "cat") data[c_categorical] = pd.qcut(data[c], q=3).cat.codes c_binary = c.replace("cm", "binary") data[c_binary] = pd.qcut(data[c], q=2).cat.codes
for i, c in enumerate(data.columns): data[c] = data[c].sample(frac=.9, random_state=42*i)
mm_data, mm_descriptor = get_mixed_descriptor( dataframe=data, continuous_nan=['sepal length (cm)', 'sepal width (cm)'], binary_nan=['sepal length (binary)', 'sepal width (binary)'], categorical_nan=['sepal length (cat)', 'sepal width (cat)'], )
sm_data, sm_descriptor = get_mixed_descriptor( dataframe=data, categorical_nan=['petal length (cat)', 'petal width (cat)'], )
model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=500, random_state=123, verbose=1) model.fit(mm_data, sm_data)
model, bootstrapped_params = bootstrap(model, mm_data, sm_data, n_repetitions=10)
As detailed in the boostrap documentation, bootstrap first fits the estimator on the original data then runs the repetitions. This is why you first see the progress bar for n_init and then for the bootstrap repetitions. This means you do not need to call the fit method when using the bootstrap function.
When you set n_init > 1, the model always picks the best parameters. This is also true for the repetitions in the bootstrap loop.
I'm considering not fitting the main estimator in the bootstrap function and requiring the user to call fit before using bootstrap (Like you did). A user may need to call bootstrap on an estimator he already fitted for other purposes, or may want to bootstrap using different data.
Moved the bootstrap idea to #36
@FelixLaliberte did this answer your question? Feel free to close this issue if it did. Thanks,
Yes it did.
Thanks,
Hi,
When using the bootstrap() function, can we use the n_init argument? It seems to work when n_init=1 and n_init=2. Strangely, when n_init = 3 (or more), this error message is displayed:
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:285: RuntimeWarning: divide by zero encountered in divide precision_c = 1 / self.parameters["covariances"][c] /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:286: RuntimeWarning: invalid value encountered in multiply ll_diff = ((diff*2) precision_c.reshape(1, -1)).sum(axis=1) /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:294: RuntimeWarning: divide by zero encountered in log log_dets = np.log(pi_cov_c).sum(axis=1) /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/stepmix/emission/gaussian.py:296: RuntimeWarning: invalid value encountered in add log_eps[:, c] = -0.5 * (log_dets + ll_diff)
Here is an example with the Iris dataset:
import pandas as pd import numpy as np from stepmix.stepmix import StepMix from stepmix.utils import get_mixed_descriptor from stepmix.bootstrap import bootstrap from sklearn.datasets import load_iris from sklearn.metrics import rand_score import matplotlib.pyplot as plt
data, target = load_iris(return_X_y=True, as_frame=True)
for c in data: c_categorical = c.replace("cm", "cat") data[c_categorical] = pd.qcut(data[c], q=3).cat.codes data[c_binary] = pd.qcut(data[c], q=2).cat.codes
mm_data, mm_descriptor = get_mixed_descriptor( dataframe=data, continuous_nan=['sepal length (cm)', 'sepal width (cm)'], binary_nan=['sepal length (binary)', 'sepal width (binary)'], categorical_nan=['sepal length (cat)', 'sepal width (cat)'], )
sm_data, sm_descriptor = get_mixed_descriptor( dataframe=data, categorical_nan=['petal length (cat)', 'petal width (cat)'], )
model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=2, random_state=123) model.fit(mm_data, sm_data) model, bootstrapped_params = bootstrap(model, mm_data, sm_data, n_repetitions=100) #ok
model2 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=3, random_state=123) model2.fit(mm_data, sm_data) model2, bootstrapped_params2 = bootstrap(model2, mm_data, sm_data, n_repetitions=100) #not ok
model3 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, random_state=123) model3.fit(mm_data, sm_data) model3, bootstrapped_params3 = bootstrap(model3, mm_data, sm_data, n_repetitions=100) #ok
model4 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=1, random_state=123) model4.fit(mm_data, sm_data) model4, bootstrapped_params4 = bootstrap(model4, mm_data, sm_data, n_repetitions=100) #ok
model5 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=100, random_state=123) model5.fit(mm_data, sm_data) model5, bootstrapped_params5 = bootstrap(model5, mm_data, sm_data, n_repetitions=100) #not ok
model6 = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, n_init=3, random_state=123) model6.fit(mm_data, sm_data) model6, bootstrapped_params6 = bootstrap(model6, mm_data, sm_data, n_repetitions=1000) # also not ok at n_repetitions=10000