borchero / pycave

Traditional Machine Learning Models for Large-Scale Datasets in PyTorch.
https://pycave.borchero.com
MIT License
126 stars 13 forks source link

Mini-batch training on GMM #19

Open Daisy-GENG opened 2 years ago

Daisy-GENG commented 2 years ago

Hi,

I want to implement mini-batching training on GMM as discussed in #7 . However, I am little bit confused by the code gmm.reset_parameters(torch.Tensor(fvectors[:500].astype(np.float32))). I am not sure whether it is related to my version of pycave, or maybe my understanding to the code in #7 is wrong. My code doesn't work.

My code are as follows:

from pycave.bayes.gmm import GaussianMixture as GM
from dataloader.gmm_dataset import gmm_dataset

train_gmm_dataset = gmm_dataset(data_path)
train_dataset_loader = torch.utils.data.DataLoader(dataset=train_gmm_dataset,
                                                        batch_size=train_dataloader_config["batch_size"],
                                                        shuffle=train_dataloader_config["shuffle"],
                                                        num_workers=train_dataloader_config["num_workers"])

for i, data in enumerate(train_dataset_loader):  # data:[1, pt, 3]
    data = torch.squeeze(data, 0)
    gmm = GM(num_components=2, covariance_type="diag", init_strategy="kmeans")
    gmm.model_.reset_parameters(data)  
    history = gmm.fit(train_dataset_loader)

And the error is:

`GaussianMixture` has not been fitted yet

Thank you so much!

Best regards, Daisy

borchero commented 2 years ago

Issue #7 still referred to PyCave version 2. In PyCave v3, you don't need to call gmm.model_.reset_parameters: the model_ attribute will only be available once fit has returned without error.

I believe that this should be the line that causes your error.

Daisy-GENG commented 2 years ago

So is there a similar way to implement batch training in PyCave version 3 using dataloader? My whole dataset is large, so I cannot load all the data into the memory once.

Thank you so much!

Best regards, Daisy

borchero commented 2 years ago

Ah, sorry! Yes, you can simply set the batch size when initializing the GMM. In your case, you might, for example, use:

gmm = GM(..., batch_size=8192)

This will automatically take care to load data in batches, both for initialization and GMM training. Note that you might be better off with init_strategy='kmeans++' since kmeans is quite costly to run. You'll need PyCave 3.1.3 for that, though (there was a bug for kmeans++ initialization before).