borchero / pycave

Traditional Machine Learning Models for Large-Scale Datasets in PyTorch.
https://pycave.borchero.com
MIT License
126 stars 13 forks source link

GMM fitting with full covariance crashes unexpectedly unlike sklearn GMM fitting #20

Open MacDaddio opened 2 years ago

MacDaddio commented 2 years ago

I was trying to fit a GMM on data and kept getting the same error (with variying numbers for the batch and order):

_LinAlgError: torch.linalg_cholesky: (Batch element 3): The factorization could not be completed because the input is not positive-definite (the leading minor of order 941 is not positive-definite).

I made a minimal working example to show when this comes up in practice. I compared to sklearn and somehow sklearn is able to avoid this problem. This issue happens both on CPU and GPU. I have PyCave 3.1.3 and sklearn 0.24.2. Do you have any idea what could be the issue?

Minimum working example:

from pycave.bayes import GaussianMixture import torch import numpy as np from sklearn import mixture

Set seed

seed = 0 np.random.seed(seed) torch.manual_seed(seed)

Inputs

n = 5000 p = 2000 k = 10

Make some non-Gaussian data

X = np.random.randn(n,p) X = torch.Tensor(X) X = torch.nn.ReLU()(X-1)

Fit Sklearn GMM

gmm_sk = mixture.GaussianMixture(n_components=k, covariance_type='full', init_params='kmeans') gmm_sk.fit(X.numpy())

Fit PyCave GMM

gmm = GaussianMixture(num_components=k, covariance_type='full', init_strategy='kmeans') gmm.fit(X)

borchero commented 2 years ago

Hmm, I've seen this issue occur non-deterministically at some times, thanks for the MWE! I'll try to investigate the issue in the coming days but I'd also be happy about any more input ;)

In the meantime, you might get around your issue by increasing covariance_regularization.

MacDaddio commented 2 years ago

Increasing covariance_regularization did indeed fix the problem in my problem! It is strange that such a problem would occur though. This is my initial thought although I am new to this code. Since the data is unimodal, it is possible that the kmeans initialization results in one or more clusters with very few data points. This results in the covariance of that cluster being rank deficient and hence, not positive definite. However, both the scipy and pytorch library for Cholesky decomposition require postive-definiteness (instead of semi-definiteness) to work so I am not sure what the problem is. I will look into it a bit more tomorrow but thank you for the quick response!

MacDaddio commented 2 years ago

I have found out another test case which fails unexpectedly. Below is my sample code. I use tied covariances here and the ONLY way I can get the training to converge is to set the covariance regularization to 10.0. You can try it for values of 1.0, 1e-1, 1e-2, 1e-3 and 1e-6 but they all fail. I find this odd because the dimensionality of the data is relatively low here. If you look at the eigenvalues of the covariance matrix (using covariance regularization of 1e-6) you get -0.7701, -0.5969, 0.9763, 0.9994, 1.0165 so it is clearly not positive-definite, although it is still symmetric. Therefore, however the covariances are being computed should be at fault because covariance matrices cannot have these kind of eigenvalues. At first I thought maybe there were some small negative eigenvalues due to numerical precision erros but these are the same magnitude as the positive eigen values. I will try to find out more by poking around the lightning module stuff!

from pycave.bayes import GaussianMixture import torch

Set seed

seed = 0 torch.manual_seed(seed) torch.cuda.manual_seed(seed)

Inputs

n = 10000 p = 5 k = 2 batch_size = 1000

Make some Gaussian data

X = torch.randn(n,p)

Fit PyCave GMM

gmm = GaussianMixture(num_components=k, covariance_type='tied', init_strategy='kmeans++', batch_size=batch_size, trainer_params={'gpus':1,'enable_progress_bar':False}, covariance_regularization=1.0) gmm = gmm.fit(X)

borchero commented 2 years ago

Does this issue occur when you do not perform mini-batch training? Also, I would advise to try using double precision (I think you can pass precision = 64 to the trainer params).

MacDaddio commented 2 years ago

If I use the whole dataset of 10,000 points instead of 1,000 mini-batches, I still get the same issue for covariances regularization under 1.0; however, the 1.0 covariance regularization now works. I passed precision=64 to the trainer with no change in behavior.

MacDaddio commented 2 years ago

I just realized that if I initialize with 'kmeans' instead of 'kmeans++', it works fine. So maybe there is something weird going on with 'kmeans++'?

MacDaddio commented 2 years ago

When I fit GMMs using the kmeans or kmeans++ initializations, I get a non-positive-definiteness error if the covariance regularization is too low. This error comes from the initialization typically and not necessarily from the fitting of the GMM. Can there be two different covariane regularizations? One for the initialization and one for fitting? Because one may want to have a more regularized initialization so that you get a good start but not necessarily have a heavily regularized GMM fit.