[MRG] Fixes numerical issue in GMM.condition

AlexanderFabisch commented 3 years ago

[x] fix nan priors
[x] add test case that fails without this pull request
[x] another problem: covariance may not always be positive semi-definite, take a look at Oracle Approximating Shrinkage: https://github.com/scikit-learn/scikit-learn/blob/95119c13a/sklearn/covariance/_shrunk_covariance.py#L434
[x] test case for oracle_approximating_shrinkage in GMM.from_samples
[x] apidoc for oracle_approximating_shrinkage
[x] document prior computation in GMM.condition

AlexanderFabisch commented 3 years ago

@mralbu would be great if you could take a look at this PR.

mralbu commented 3 years ago

Will take a look as soon as I can!

mralbu commented 3 years ago

I checked a similar code snippet as in #27 on the proposed fix, running beforehand the following RegressorMixin class definition:

from sklearn.base import BaseEstimator, RegressorMixin, MultiOutputMixin
from sklearn.utils import check_X_y
from sklearn.utils.validation import check_is_fitted, check_array, FLOAT_DTYPES

from gmr import GMM

class GaussianMixtureRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):

    def __init__(self, n_components, priors=None, means=None, covariances=None,
                 verbose=0, random_state=None, R_diff=1e-4, n_iter=500, init_params="random", oracle_approximating_shrinkage=False):
        self.n_components = n_components
        self.priors = priors
        self.means = means
        self.covariances = covariances
        self.verbose = verbose
        self.random_state = random_state
        self.R_diff = R_diff
        self.n_iter = n_iter
        self.init_params = init_params
        self.oracle_approximating_shrinkage = oracle_approximating_shrinkage

    def fit(self, X, y):
        self.gmm_ = GMM(self.n_components, priors=self.priors, means=self.means,
                        covariances=self.covariances, verbose=self.verbose, random_state=self.random_state)

        X, y = check_X_y(X, y, estimator=self.gmm_, dtype=FLOAT_DTYPES, multi_output=True)
        if y.ndim == 1:
            y = np.expand_dims(y, 1)

        self.indices_ = np.arange(X.shape[1])

        self.gmm_.from_samples(np.hstack((X, y)),
                               R_diff=self.R_diff, n_iter=self.n_iter, init_params=self.init_params, oracle_approximating_shrinkage=self.oracle_approximating_shrinkage)
        return self

    def predict(self, X):
        check_is_fitted(self, ["gmm_", "indices_"])
        X = check_array(X, estimator=self.gmm_, dtype=FLOAT_DTYPES)

        return self.gmm_.predict(self.indices_, X)

Code snippet as in #27:

import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_validate

X, y = load_boston(return_X_y=True)

np.random.seed(42)

cross_validate(GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=False), X, y)
>> {'fit_time': array([0.02808738, 0.00212717, 0.00335884, 0.00367188, 0.00159121]),
>>  'score_time': array([0.06866217, 0.06366682, 0.06335163, 0.06228876, 0.06565619]),
>>  'test_score': array([-1.77181999e+03,  7.71024953e-01,  5.81598835e-01,  7.68927960e-02, -1.41784818e+02])}

np.random.seed(42)

cross_validate(GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=True), X, y)
>> {'fit_time': array([0.01825643, 0.0023396 , 0.03778887, 0.012429  , 0.00279951]),
>> 'score_time': array([0.06537795, 0.05936241, 0.05965519, 0.06038404, 0.0568521 ]),
>> 'test_score': array([0.32767343, 0.5339876 , 0.27381149, 0.40139363, 0.38142808])}

def is_pos_def(x):
    return np.all(np.linalg.eigvals(x) >= 0)

np.random.seed(42)

gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=False)
gmr.fit(X, y)

print([is_pos_def(sigma) for sigma in gmr.gmm_.covariances])
>> [False, True]

np.random.seed(42)

gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=True)
gmr.fit(X, y)

print([is_pos_def(sigma) for sigma in gmr.gmm_.covariances])
>> [True, True]

With the oracle_approximating_shrinkage algorithm, all the fitted covariances were positive semi-definite. Cross-validation on the evaluated dataset did not show np.nan or negative numbers as well.

I noticed as well that on the folds that converged in the example with oracle_approximating_shrinkage=False, the R2 metrics were different ([0.771024953, 0.581598835, 0.0768927960] for oracle_approximating_shrinkage=False and [0.5339876 , 0.27381149, 0.40139363] for oracle_approximating_shrinkage=True), but maybe this is expected given that the algorithm is stochastic, right?

That happens as well in the following snippet.

np.set_printoptions(precision=2)

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=True)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.62 0.62 0.64 0.64 0.64 0.64 0.59 0.64 0.62 0.64]

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressor(n_components=2, oracle_approximating_shrinkage=False)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [  0.74   0.74   0.75  -3.98   0.85 -14.96   0.69   0.74   0.83   0.74]

It seems that, when it converges, oracle_approximating_shrinkage=False, might lead to better R2 metrics.

AlexanderFabisch commented 3 years ago

Thank you for your analysis.

I noticed as well that on the folds that converged in the example with oracle_approximating_shrinkage=False, the R2 metrics were different ([0.771024953, 0.581598835, 0.0768927960] for oracle_approximating_shrinkage=False and [0.5339876 , 0.27381149, 0.40139363] for oracle_approximating_shrinkage=True), but maybe this is expected given that the algorithm is stochastic, right?

Yes, of course the initialization of MVNs is stochastic, but even if it wasn't I would expect similar results. I think shrinkage of the covariance has an effect similar to regularization in linear regression (not the same, although that would be interesting to investigate...).

mralbu commented 3 years ago

Did some additional comparisons with an experiment using sklearn.mixture.GaussianMixture machinery for fitting the regressor.

from sklego.mixture import GMMRegressor

np.set_printoptions(precision=4)

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478 0.8478]

np.random.seed(2)

scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2, init_params='random', max_iter=100)
    gmr.fit(X, y)
    scores.append(gmr.score(X, y))
print(np.array(scores))
>> [0.8157 0.8061 0.8221 0.8152 0.8221 0.8192 0.8479 0.8282 0.8251 0.7792]

Maybe using internal sklearn.mixture machinery might help ease numerical issues, though it would introduce sklearn as a hard dependency and might be out of scope. On the other hand, it would enable the introduction of other regressors such as BayesianGMMRegressor in an easy way, and would have familiar parameters (the same as in sklearn.mixture.GaussianMixture). Do you think exploring the use of sklearn.mixture inner workings would be interesting for gmr?

AlexanderFabisch commented 3 years ago

Do you think exploring the use of sklearn.mixture inner workings would be interesting for gmr?

Yes, definitely.

I wouldn't like to introduce sklearn as a required dependency though. You can easily initialize a GMM object from the priors, means, and covariances of a sklearn class, which should make the implementation of BayesianGMMRegressor easy. There are actually some examples that use sklearn's Bayesian GMM.

AlexanderFabisch commented 3 years ago

I opened a new issue here: https://github.com/AlexanderFabisch/gmr/issues/30

AlexanderFabisch / gmr

[MRG] Fixes numerical issue in GMM.condition #28