Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

1-step and 3-step estimators with covariates have different measurement models #43

Closed FelixLaliberte closed 1 year ago

FelixLaliberte commented 1 year ago

Hi,

It seems that there is an issue with the models estimated with covariates. The measurement models are different from the models estimated without covariates. Here is a simple example :

import pandas as pd
import numpy as np
from stepmix.stepmix import StepMix
from sklearn.datasets import load_iris

data, target = load_iris(return_X_y=True, as_frame=True)

for c in data:
  c_binary = c.replace("cm", "binary")
  data[c_binary] = pd.qcut(data[c], q=2).cat.codes

X = pd.DataFrame(data, columns=['sepal width (binary)', 
                                        'petal length (binary)', 
                                        'petal width (binary)'])

sepal_length_Binary = data['sepal length (binary)']

model1 = StepMix(n_components=3, 
                measurement='binary', 
                random_state=123,
                verbose=1)

model1.fit(X)

model2 = StepMix(n_components=3, 
                measurement='binary', 
                structural='covariate', 
                n_steps=3,
                random_state=123,
                verbose=1)

model2.fit(X, sepal_length_Binary)

Thank you!

sachaMorin commented 1 year ago

I recommend using Python markdown for code! It helps with readability. Like this:

print("Python is great")

I edited your question to include this.

sachaMorin commented 1 year ago

Back to your issue, can you provide the code you used to conclude that the models were different? You inspected the measurement parameters?

sachaMorin commented 1 year ago

Sorry for the confusion. I guess you looked at the print statements of the verbose output and saw differences in the measurement models. I can reproduce and will look into this.

sachaMorin commented 1 year ago

I think I understand what's up. After testing, this only happens if the structural model is covariate. gaussian_unit for example behaves as expected.

The issue is a consequence of the likelihood we optimize for. In the second model, StepMix detects the presence of a covariate and switches to the conditional likelihood perspective for the entire estimation. See page 4 in the preprint. This means that the marginal over latent classes P(X) is omitted from likelihood computations, including during the first step where we fit the measurement model during 3-step estimation. Here we only maximize P(Y|X).

This explains the difference with your first model without covariates, which uses the generative perspective and therefore includes P(X) to maximize the joint P(X, Y), leading to different final parameters.

sachaMorin commented 1 year ago

I agree this is a bit strange, but I'm not sure it's a bug. Let's see what @robinlegault thinks

sachaMorin commented 1 year ago

The question boils down to this. During 3-step estimation with covariates, should the first step use the conditional likelihood P(Y|X) or the joint likelihood P(X, Y)=P(Y|X)P(X)?

sachaMorin commented 1 year ago

Reviewing the paper, it seems we should always be using P(X,Y) during the first step, regardless of covariates. Would appreciate someone else's input here.

robinlegault commented 1 year ago

I agree we should always use P(X,Y) during the first step of stepwise approaches, as we want to keep the estimators of the MM parameters independent from the SM (including the covariates).

FelixLaliberte commented 1 year ago

I have indeed looked at the classes’ prevalence and conditional probabilities using the verbose output.

Thank you for the information you have provided. The issue has indeed been resolved.