ContinualAI / avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
http://avalanche.continualai.org
MIT License
1.72k stars 281 forks source link

feature request: semi-supervised, unsupervised, and novelty detection common benchmarks #1329

Open AlbinSou opened 1 year ago

AlbinSou commented 1 year ago

More and more CL papers are tackling unsupervised and a mix of supervised and unsupervised learning in the different tasks. I don't think there is a clean way to address that in avalanche right now so I'm proposing an idea.

I think it could be a nice idea to propose a general benchmark-maker, or an option to existing generators, where you decide how many labels in % are available for each task, with the option to chose which one of these labels are available, or to just set them at random.

As an example if we have five tasks here is how it would go for different existing problems

Setting | Label % per task (i.e 3 tasks)

Unsupervised | 0 | 0 | 0 Semi-Supervised | 5% | 5% | 5% New Class Discovery (NCD) | 100% (pretraining task) | 0% | 0% A mix of NCD and Semi-Supervised | 100% | 5% | 5%

That way it should be pretty easy to implement all of these settings that play with the availability of the labels. This could be implemented by a simple masking of which samples labels should be replaced by a given value (-1?) and then strategies could filter out what part of the input they can train on based on the labels in the current experience, for instance, a SupervisedTemplate would only treat the samples that have label, by means of self.mbatch = self.mbatch[self.mbatch[1] != -1] or something like that.

This definitely needs more thinking but I think a systematic treatment of these kind of scenarios could be interesting to integrate

AntonioCarta commented 1 year ago

The unsupervised seems easy (remove the labels for the training data). For the semi-supervised, should we have separate datasets in the training experience? Example:

# at training time
train_exp.unlabeled_data # unlabeled subset <x, t> tuples (task id optional)
train_exp.labeled_data  # labeled subset <x,y,t> tuples (task id optional)

# at evaluation time
train_exp.dataset  # the full dataset with labels - only available at eval time <x,y,t> tuples (task id optional)
train_exp.unlabeled_data  # the unlabeled subset, but at eval time we need to provide labels too <x,y,t> tuples (task id optional)
train_exp.labeled_data  # same as the train data <x,y,t> tuples (task id optional)

Not sure about NCD, I don't have a lot of experience with it.