new implementation of Bagging

SunHaozhe commented 1 year ago

As discussed in this Issue, here I provide the new way of implementing bagging.

The old implementation does sampling with replacement after each batch is drawn and duplicates are also removed, this new implementation does sampling with replacement at the very beginning of the fit method of Bagging, which results in maintaining several independent dataloaders/datasets, one for each base estimator.

The advantages of this new implementation:

The batch size is no longer variable and random, the user-defined batch size is always used.
Duplicated samples are no longer removed.
The sampling with replacement is performed before the (online) data augmentations. Each base estimator learns a subset of the (un-augmented) training data.
No matter how many training epochs, each base estimator will have its own unseen (un-augmented) training data.

Maintaining N independent dataloaders (suppose N estimators) during the training stage will not waste memory space, because there is actually only one copy of the (original) dataset which is stored in the memory/disk. This can be seen in the source code of torch.utils.data.Subset. We only keep N lists of indices (a list of integers), the memory overhead is thus negligible.

The modification is minimal and does not violate existing API.

xuyxu commented 1 year ago

Thanks for your nice PR! I will fix the coding style and CI issue. Meanwhile, could you evalute the model performance before and after your merge (e.g., run a bagging classifier on CIFAR-10), and use memory_profile to monitor the use on memory space? Thanks!

EDIT: Please ignore the CI problem. All unit tests finished successfully on my local machine.

xuyxu commented 1 year ago

@all-contributors please add @SunHaozhe as a contributor for code

allcontributors[bot] commented 1 year ago

@xuyxu

I've put up a pull request to add @SunHaozhe! :tada:

SunHaozhe commented 1 year ago

could you evalute the model performance before and after your merge (e.g., run a bagging classifier on CIFAR-10), and use memory_profile to monitor the use on memory space? Thanks!

Thanks a lot for your reply :)

OK, I will evaluate the model performance before and after my merge by running a bagging classifier on CIFAR-10. I will also use memory_profile to check whether the memory overhead is actually negligible. I will update on this as soon as I have the results.

SunHaozhe commented 1 year ago

I evaluated the model performance before and after my merge. Concretely I run the Bagging classifier on CIFAR-10 with LeNet-5 for 100 epochs (before and after this merge). I tested 2 settings: n_estimators=5 and n_estimators=10.

Here are the results:

Before this merge (n_estimators=5): Validation Acc: 76.180 % | Historical Best: 76.660 % After this merge (n_estimators=5): Validation Acc: 77.190 % | Historical Best: 77.190 %

Before this merge (n_estimators=10): Validation Acc: 77.240 % | Historical Best: 77.710 % After this merge (n_estimators=10): Validation Acc: 77.520 % | Historical Best: 77.910 %

The values shown by memory_profile seem to contain a little randomness. The line-wise memory incrementation of the line train_loader = _get_bagging_dataloaders(train_loader, self.n_estimators) is 2.2 MiB and 0.9 MiB respectively for the 2 runs after the merge. The fit() function consumes about 3850.75 MiB in total before this merge, it consumes about 3873.55 MiB in total after this merge. The memory overhead is light.

The code for this test can be found at before-merge test branch and after-merge test branch (the script test_bagging_cifar10.py)

xuyxu commented 1 year ago

Merged. Thanks for your great work! @SunHaozhe

SunHaozhe commented 1 year ago

Thanks a lot :)

TorchEnsemble-Community / Ensemble-Pytorch

new implementation of Bagging #120