Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

Covariate, Distal Outcomes, and Indicators #52

Closed yuanjames closed 8 months ago

yuanjames commented 8 months ago

Hi,

Thank you for your work in building such a great Python repository!

I have recently used this repository; however, I am a bit confused regarding the three input data types.

I believe that a general Latent Class Analysis (LCA) model could potentially have three types of input data: Covariates, Distal Outcomes, and Indicators. However, I noticed that only two variables are allowed. Do you have any suggestions? If my understanding is correct, covariates and indicators are combined into an array used to estimate the parameters of the Measurement Model (MM). If we have distal outcomes (target labels), then the Structural Model (SM) is estimated again.

yuanjames commented 8 months ago

Any suggestions for distal outcomes and covariates? I see most of examples only use X, Y (i.e., indicators with distal outcomes, or indicators with covariates). How do I distiguish them?

FelixLaliberte commented 8 months ago

Hi,

Thank you for your question.

Parameters of the Measurement Model (MM) and the Structural Model (SM) should always be differentiated. Thus, only indicators should be specified in the “measurement” argument of the StepMix function.

By default, the “structural” argument assumes that all SM variables are distal outcomes. If structural='covariate', the StepMix function assumes that all SM variables are covariates. To differentiate between covariates and distal outcomes in the structural argument, the "get_mixed_descriptor()" function must be used.

The get_mixed_descriptor() function can also be used when the MM and/or SM variables (distal outcomes) have different distributions. Categorical covariates must be one-hot encoded.

Here is a small example using the Iris Dataset. The data does not really make any sense, but the model definition is useful. The example includes

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.metrics import rand_score

from stepmix.stepmix import StepMix
from stepmix.utils import get_mixed_descriptor

# First simulate a dataset
# Load dataset in a Dataframe
data, target = load_iris(return_X_y=True, as_frame=True)

# Create categorical and binary data based on the Iris data quantiles
for c in data:
  # Create new column name 
  c_categorical = c.replace("cm", "cat")
  data[c_categorical] = pd.qcut(data[c], q=3).cat.codes
  c_binary = c.replace("cm", "binary")
  data[c_binary] = pd.qcut(data[c], q=2).cat.codes

# Create a fake covariate
data['Total length (cm)'] = data["sepal length (cm)"] + data["petal length (cm)"]

# Add missing values in all variables, except the covariate
# Replace 50% of values with missing values
for i, c in enumerate(data.columns):
  if c != 'Total length (cm)':
    data[c] = data[c].sample(frac=.5, random_state=42*i)

# StepMix Model
# Use Sepal features as measurements
# Use Total Length as a Covariate
# Use Petal features as Distal outcomes

# Measurement model definition
mm_data, mm_descriptor = get_mixed_descriptor(
    dataframe=data,
    continuous_nan=['sepal length (cm)', 'sepal width (cm)'],
    binary_nan=['sepal length (binary)', 'sepal width (binary)'],
    categorical_nan=['sepal length (cat)', 'sepal width (cat)'],
)

# Structural model definition
sm_data, sm_descriptor = get_mixed_descriptor(
    dataframe=data,
    # Covariate
    covariate=['Total length (cm)'],
    # Distal outcomes
    continuous_nan=['petal length (cm)', 'petal width (cm)'],
    binary_nan=['petal length (binary)', 'petal width (binary)'],
    categorical_nan=['petal length (cat)', 'petal width (cat)'],
)

# Pass descriptors to StepMix and fit model
model = StepMix(n_components=3, measurement=mm_descriptor, structural=sm_descriptor, verbose=1, random_state=123)

# Fit model
model.fit(mm_data, sm_data)

Does this answer your questions?

yuanjames commented 8 months ago

Hi,

@FelixLaliberte Thanks for your reply. In fact, your answer is perfect! I have read your paper and gone through most of the APIs. I just wanted to thank you again and will definitely cite your paper.

BTW, if my understanding is correct, I think distal outcome analysis and covariate analysis both seem very similar to regression analysis (i.e, correlation analysis, I am from machine learning part, not very familar with statistical terms), the main difference between them is that X and Y may be swapped. The types of distributions used to do both analysis depending on their data type.

In addition, as you mentioned, the measurement model is indeed fitted by indicators. However, I believe that the measurement model indirectly affects the estimation of the structural model in One-Step LCA, is that correct? I noticed in the pseudocode that, in each iteration, the measurement model is estimated first, followed by the estimation of the structural model. The updated measurement model in one iteration will affect the structural model in the next iteration, especially as class membership is updated.

sachaMorin commented 8 months ago

A model with distal outcomes is similar to regression analysis insofar as they can both give you a "supervised" model to predict Y from X (at least in theory, this is not currently implemented). The assumptions on the models are different however. In regression, your model is typically linear while here, the relationship between X and Y is mediated by some latent class Z. In clustering analysis, we are typically more interested in the latent class than predicting Y.

You are correct that in one-step estimation, there is some interaction between the measurement parameters and the structural parameters. This "leak" is what prompted the research around stepwise estimation methods to insulate the measurement estimation from the structural estimation.

Edit: To clarify, the structural model will always be impacted by the measurement model through the class memberships, even with stepwise methods. Stepwise methods aim to block the other direction: your measurement model parameters should not depend on the structural data.

yuanjames commented 8 months ago

Thanks for your explanation @sachaMorin ! Yes, I agree with you, but just to share, regression models could be non-linear, for example, neural networks. If we follow the Stepwise, I thought that latent classes are fixed when MM is estimated. Then, SM always plays an 'approximator' to do either distal outcomes analysis or covariate analysis, between latent classess and external variables.

Edit: The approximator is just for latent classess (Z) and external variables(Y). However, I believe it is related to indicators X in terms of statiscs, we consider that are conditional probabilitiies. PS. Sorry for my previous wrong expression, I mentioned X and Y, X represents latent classes, and Y represents external variables. I think the difference between distal outcomes analysis or covariate analysis is just about the positions of X and Y, different assumptions.

sachaMorin commented 8 months ago

Closing. Feel free to reopen if you have other questions!