borchero / pycave

Traditional Machine Learning Models for Large-Scale Datasets in PyTorch.
https://pycave.borchero.com
MIT License
126 stars 13 forks source link

Batch size must be factor or total dataset size #21

Closed MacDaddio closed 2 years ago

MacDaddio commented 2 years ago

The mini-batch part of this repository works great! However, when the batch size is not a factor or the total dataset size, the code throws an error. Is there anyway to make it so that any batch size can be used? Below is a minimal working example of what I am talking about. Essentially, if batch_size = 1000 then everything works fine and the mini-batch procedure seems to work with all 10 batches. However, when batch_size = 999, the last batch (of size 10) causes an error. Thanks!

from pycave.bayes import GaussianMixture import torch

Set seed

seed = 0 torch.manual_seed(seed) torch.cuda.manual_seed(seed)

Inputs

n = 10000 p = 200 k = 5 batch_size = 999 #1000

Make some non-Gaussian data

X = torch.randn(n,p)

Fit PyCave GMM

gmm = GaussianMixture(num_components=k, covariance_type='full', init_strategy='kmeans++', batch_size=batch_size, trainer_params={'gpus':1,'enable_progress_bar':False}, covariance_regularization=1e-3) gmm = gmm.fit(X)

borchero commented 2 years ago

Thanks for the example code, I’ll have a look later!

borchero commented 2 years ago

Thanks a lot for this issue! It's an incredibly easy fix but has pretty big implications for mini-batch K-Means++ 😄