Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

bootstrapping error #23

Closed FelixLaliberte closed 1 year ago

FelixLaliberte commented 1 year ago

Hi,

I am new to Python and I am trying to produce a model with a categorical distal outcome that has missing values. However, I am having to use bootstrapping, because I get this error message while using the bootstrap() function:

KeyError: "None of [Index([109, 126, 66, 98, 17, 83, 106, 123, 57, 96,\n ...\n 50, 101, 130, 146, 46, 45, 18, 73, 62, 139],\n dtype='int64', length=150)] are in the [columns]"

Here is a simple example with the Iris database:

import pandas as pd from sklearn.datasets import load_iris from sklearn.metrics import rand_score from stepmix.stepmix import StepMix from stepmix.utils import get_mixed_descriptor from stepmix.bootstrap import bootstrap

data, target = load_iris(return_X_y=True, as_frame=True)

for c in data: c_categorical = c.replace("cm", "cat") data[c_categorical] = pd.qcut(data[c], q=3).cat.codes c_binary = c.replace("cm", "binary") data[c_binary] = pd.qcut(data[c], q=2).cat.codes

for i, c in enumerate(data.columns): data[c] = data[c].sample(frac=.5, random_state=42*i)

mm_data, mm_descriptor = get_mixed_descriptor( dataframe=data, continuous_nan=['sepal length (cm)', 'sepal width (cm)'], binary_nan=['sepal length (binary)', 'sepal width (binary)'], categorical_nan=['sepal length (cat)', 'sepal width (cat)'], )

sm_data, sm_descriptor = get_mixed_descriptor( dataframe=data, categorical_nan=['petal length (cat)', 'petal width (cat)'], )

model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, verbose=1, random_state=123) model.fit(mm_data, sm_data)

here is the error model, bootstrapped_params = bootstrap(model, mm_data, sm_data, n_repetitions=1000)

sachaMorin commented 1 year ago

I can reproduce. Seems to be an issue with the bootstrap function taking dataframes as inputs.

I'll fix this soon, but in the meantime you can replace the last line with

model, bootstrapped_params = bootstrap(model, mm_data.to_numpy(), sm_data.to_numpy(), n_repetitions=20)