Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
54 stars 4 forks source link

Shape mismatch when predicting new data #19

Closed sachaMorin closed 1 year ago

sachaMorin commented 1 year ago

The current categorical model dynamically determines the number of outcomes to use for one-hot encoding based on the input integer-encoded data. Following the fix to issue #17, I found that this can lead to errors when the test data does not include samples from all categories, yielding a one-hot encoding of the wrong shape.

How to reproduce this error?

import numpy as np
from stepmix import StepMix

train = np.random.choice([0, 1, 2, 3], 100).reshape((-1, 1))
test = np.random.choice([0, 1, 3], 100).reshape((-1, 1))  # No class 2

model = StepMix(n_components=3, measurement="categorical", verbose=1, random_state=123)

model.fit(train)

preds = model.predict(test)

Expected behavior Prediction should work even if not all categories are present in the test sample.

What happens

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 3)
MostafaAbdelrashied commented 1 year ago

Why do we even use one-hot encoding when the input data is already encoded? I think the multinomial distribution can natively handle multi-classes, isn't it?

Potential solution:

  def encode_features(self, X):
      if self.integer_codes:
          self._cache, nb_classes = max_one_hot(X, max_n_outcomes=self.n_outcomes)
          if nb_classes > self.n_outcomes:
              self.n_outcomes = nb_classes
          return self._cache
      else:
          return X

and change the one-hot encoding func

def max_one_hot(array, max_n_outcomes=2):
  .
  .
  .
  # Get maximal number of outcomes
  if max_n_outcomes <= 2:
      max_n_outcomes = int(array.max() + 1)
  .
  .
  .
sachaMorin commented 1 year ago

The original categorical code used one-hot encodings, and the support for integer categories was added afterward, hence the need for the encode_features step to avoid rewriting the entire class. I will adapt your fix, and I will also remove this block

# First iterate over columns and make sure the categories are 0-indexed
    for c in range(array.shape[1]):
        _, array[:, c] = np.unique(array[:, c], return_inverse=True

and simply use the max over provided integers, which should result in more predictable behavior.