Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

Example of complete model with covariate, outcomes and missing values #35

Closed sachaMorin closed 7 months ago

sachaMorin commented 1 year ago

We should document the most complex model StepMix can estimate. Users could start from this example and reduce it to their use case. Here's an example of a complete model on an expanded version of the Iris Dataset. The data does not really make any sense, but the model definition is useful. The example includes

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.metrics import rand_score

from stepmix.stepmix import StepMix
from stepmix.utils import get_mixed_descriptor

# Load dataset in a Dataframe
data, target = load_iris(return_X_y=True, as_frame=True)

# Create categorical and binary data based on the Iris data quantiles
for c in data:
  # Create new column name 
  c_categorical = c.replace("cm", "cat")
  data[c_categorical] = pd.qcut(data[c], q=3).cat.codes
  c_binary = c.replace("cm", "binary")
  data[c_binary] = pd.qcut(data[c], q=2).cat.codes

# Create a fake covariate
data['Total length (cm)'] = data["sepal length (cm)"] + data["petal length (cm)"]

# Add missing values in all variables, except the covariate
# Replace 50% of values with missing values
for i, c in enumerate(data.columns):
  if c != 'Total length (cm)':
    data[c] = data[c].sample(frac=.5, random_state=42*i)

# StepMix Model
# Use Sepal features as measurements
# Use Petal features as Outcomes
# Use Total Length as a Covariate

# Measurement model definition
mm_data, mm_descriptor = get_mixed_descriptor(
    dataframe=data,
    continuous_nan=['sepal length (cm)', 'sepal width (cm)'],
    binary_nan=['sepal length (binary)', 'sepal width (binary)'],
    categorical_nan=['sepal length (cat)', 'sepal width (cat)'],
)

# Structural model definition
sm_data, sm_descriptor = get_mixed_descriptor(
    dataframe=data,
    # Covariate
    covariate=['Total length (cm)'],
    # Outcomes
    continuous_nan=['petal length (cm)', 'petal width (cm)'],
    binary_nan=['petal length (binary)', 'petal width (binary)'],
    categorical_nan=['petal length (cat)', 'petal width (cat)'],
)

# Pass descriptors to StepMix and fit model
model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, verbose=1, random_state=123)

# Fit model
model.fit(mm_data, sm_data)

preds = model.predict(mm_data, sm_data)

print(f"Rand Score: {rand_score(preds, target)}")
sachaMorin commented 7 months ago

We know have a tutorial on this. See the end of this tutorial. I'll close this issue for now, but feel free to reopen if something comes up.