Labo-Lacourse / stepmix

A Python package following the scikit-learn API for model-based clustering and generalized mixture modeling (latent class/profile analysis) of continuous and categorical data. StepMix handles missing values through Full Information Maximum Likelihood (FIML) and provides multiple stepwise Expectation-Maximization (EM) estimation methods.
https://stepmix.readthedocs.io/en/latest/index.html
MIT License
57 stars 4 forks source link

ValueError: pvals < 0, pvals > 1 or pvals contains NaNs #65

Open jaanisfehling opened 4 weeks ago

jaanisfehling commented 4 weeks ago

I am gettintg ValueError: pvals < 0, pvals > 1 or pvals contains NaNs while running this code

descriptor = {}
if len(num_cols) > 0:
    descriptor["cont_model"] = {
        "model": "gaussian_spherical_nan",
        "n_columns": len(num_cols),
    }
if len(cat_cols) > 0:
    descriptor["cat_model"] = {
        "model": "multinoulli_nan",
        "n_columns": len(cat_cols),
    }
if len(bool_cols) > 0:
    descriptor["bool_model"] = {
        "model": "bernoulli_nan",
        "n_columns": len(bool_cols),
    }
df = df[num_cols + cat_cols + bool_cols]  # reorder columns

model = StepMix(
    n_components=3,
    measurement=descriptor,
    progress_bar=0,
)
p_values = blrt_sweep(model, df, None, low=2, high=7, n_repetitions=1, verbose=False)
k = p_values.reset_index()["p"].idxmax() + 2

Full Traceback:

analyze.py:64: in main
    p_values = blrt_sweep(model, df, None, low=2, high=7, n_repetitions=1, verbose=False)
../../.cache/pypoetry/virtualenvs/clust-R4YYD0nC-py3.12/lib/python3.12/site-packages/stepmix/bootstrap.py:307: in blrt_sweep
    blrt(
../../.cache/pypoetry/virtualenvs/clust-R4YYD0nC-py3.12/lib/python3.12/site-packages/stepmix/bootstrap.py:243: in blrt
    _, stats_null = bootstrap(
../../.cache/pypoetry/virtualenvs/clust-R4YYD0nC-py3.12/lib/python3.12/site-packages/stepmix/bootstrap.py:140: in bootstrap
    X_rep, Y_rep, _ = sampler.sample(n_samples)
../../.cache/pypoetry/virtualenvs/clust-R4YYD0nC-py3.12/lib/python3.12/site-packages/stepmix/stepmix.py:1592: in sample
    [self._mm.sample(c, int(sample)) for c, sample in enumerate(n_samples_comp)]
../../.cache/pypoetry/virtualenvs/clust-R4YYD0nC-py3.12/lib/python3.12/site-packages/stepmix/emission/nested.py:100: in sample
    acc.append(m.sample(class_no, n_samples))
../../.cache/pypoetry/virtualenvs/clust-R4YYD0nC-py3.12/lib/python3.12/site-packages/stepmix/emission/categorical.py:232: in sample
    self.random_state.multinomial(1, feature_weights[k], size=n_samples)
numpy/random/mtrand.pyx:4354: in numpy.random.mtrand.RandomState.multinomial
    ???
_common.pyx:391: in numpy.random._common.check_array_constraint
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: pvals < 0, pvals > 1 or pvals contains NaNs

_common.pyx:377: ValueError

So apparently its related to the numpy random multinomial function.

Runtime warnings I also get:

lib/python3.12/site-packages/stepmix/emission/gaussian.py:360: RuntimeWarning: divide by zero encountered in divide
    precision_c = 1 / self.parameters["covariances"][c]

lib/python3.12/site-packages/stepmix/emission/gaussian.py:361: RuntimeWarning: invalid value encountered in multiply
    ll_diff = ((diff**2) * precision_c.reshape(1, -1)).sum(axis=1)

lib/python3.12/site-packages/stepmix/emission/gaussian.py:369: RuntimeWarning: divide by zero encountered in log
    log_dets = np.log(pi_cov_c).sum(axis=1

lib/python3.12/site-packages/stepmix/emission/gaussian.py:371: RuntimeWarning: invalid value encountered in add
    log_eps[:, c] = -0.5 * (log_dets + ll_diff)

lib/python3.12/site-packages/stepmix/stepmix.py:968: ConvergenceWarning: Initializations did not converge. Try different init parameters, or increase max_iter, abs_tol, rel_tol or check for degenerate data.
    warnings.warn(
jaanisfehling commented 4 weeks ago

It happens on this dataset: https://archive.ics.uci.edu/dataset/713/auction+verification I changed the code in emission/categorical.py at line 224:

def sample(self, class_no, n_samples):
    pis = self.parameters["pis"].T
    n_features = self.get_n_features()
    feature_weights = pis[:, class_no].reshape(
        n_features, self.parameters["max_n_outcomes"]
    )
    if np.isnan(feature_weights).any() or (feature_weights < 0).any() or (feature_weights > 1).any():
        print(feature_weights)
        raise ValueError("Probabilities must be non-negative and not NaN.")
    X = np.array(
        [
            self.random_state.multinomial(1, feature_weights[k], size=n_samples)
            for k in range(n_features)
        ]
    )
    X = np.argmax(X, axis=2)  # Convert to integers
    return X.T

I printed the feature weights and it looks like some random feature (always changing though) has the probability for the first outcome at 1, while the other outcomes have probabilities not equal 0.

Example feature_weights:

[[3.54571654e-01 3.22714173e-01 3.22714173e-01 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [5.72262762e-03 2.19909888e-01 3.87183742e-01 3.87183742e-01
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [8.02896055e-02 6.15372367e-02 6.57794359e-02 6.57794359e-02
  6.57794359e-02 6.57794359e-02 6.57794359e-02 6.57794359e-02
  6.57794359e-02 6.57794359e-02 5.01022066e-02 1.31967595e-02
  2.42142443e-02 2.42142443e-02 2.42142443e-02 2.42142443e-02
  2.42142443e-02 2.42142443e-02 2.42142443e-02 2.42142443e-02
  2.42142443e-02 5.76776316e-16 5.63450065e-03 5.63450065e-03
  5.63450065e-03 5.63450065e-03 5.63450065e-03 5.63450065e-03
  5.63450065e-03 5.63450065e-03 5.63450065e-03 5.16712205e-16]
 [3.00957067e-01 2.94573371e-01 3.88412944e-10 2.01455888e-01
  5.38628585e-16 2.03013674e-01 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.00000000e+00 2.77882835e-17 2.44487462e-16 2.51879007e-16
  1.62340135e-16 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
sachaMorin commented 3 weeks ago

Thanks for reporting this.

The warnings suggest that the estimation is diverging and likely introducing NaNs in the parameters, which triggers errors in the sampling function. One way to address this is to rely on the simpler gaussian_unit model, which tends to be more stable.

I also see that you are using the blrt_sweep function. Can you reproduce the error outside of this with a single estimator?

Regarding your second post on the sum of probabilities: are you sure this is triggering the error? This is likely the result of numerical calculations: the probabilities won't always perfectly sum to 1. For example, I tested the numpy multinomial sampling with the distribution [1. , 1e-16, 1e-16] and it works fine on my end.

jaanisfehling commented 2 weeks ago

Thanks for the help. I will try different models, but in my testing the gaussian_sperical performed best (maybe because it is the most robust to outliers). I could not get blrt_sweep to work unfortunately. Outside of that, my code worked most of the time, sometimes it did not converge and gave really bad results, I fixed this by falling back to k-means clustering in those cases.