ContinualAI / avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
http://avalanche.continualai.org
MIT License
1.72k stars 281 forks source link

Add the possibility to pretrain on multiple tasks #743

Open AlbinSou opened 2 years ago

AlbinSou commented 2 years ago

For the moment, the nc_benchmark generator function allows for a nc_first_task option, which is good for pre-training in the class-incremental learning scenario. However, the same kind of option is not available if one wants to pretrain in the task-incremental scenario. It would be nice to have an option that could be used together with task_labels=True and allows for pretraining on multiple tasks at the same time, in a multitask training manner.

This kind of pre-training is used for instance in Lifelong Learning of Compositional Structures

A quick fix that I'm using for now, but that is breaking some things (maybe to be put in bugs?) is the following:

# Number of tasks to pretrain on
pretrain = 4
pretrain_datasets = [exp.dataset for exp in scenario.train_stream[:pretrain]]

# Modify the first experience so that it contains data of the 4 first ones
first_experience = scenario.train_stream[0]
first_experience.dataset = AvalancheConcatDataset(pretrain_datasets)

# Train on the modified first experience
cl_strategy.train(first_experience)

# Train on the rest of the experiences
for t, experience in enumerate(scenario.train_stream[pretrain:]):
    cl_strategy.train(experience)

Doing this works as intended except that it multiplies the batch_size by the number of pretraining tasks for some reason:

AntonioCarta commented 2 years ago

I agree about the nc_first_task option, we should also have it for multi-task scenarios.

Your snippets seems wrong. Instead of modifying the experiences in place, it's easier to create a new benchmark by first concatenating/splitting the datasets however you like, and the using one of the generic builders, like dataset_benchmark.

If you still get an error using dataset_benchmark, feel free to open a question on the Discussions.

AlbinSou commented 2 years ago

I agree about the nc_first_task option, we should also have it for multi-task scenarios.

Your snippets seems wrong. Instead of modifying the experiences in place, it's easier to create a new benchmark by first concatenating/splitting the datasets however you like, and the using one of the generic builders, like dataset_benchmark.

If you still get an error using dataset_benchmark, feel free to open a question on the Discussions.

Yes, I agree that this is an ugly fix, I also tried like you did but the batch size is still multiplied by the number of tasks in the first experience. I think this comes from TaskBalancedDataLoader, but I don't know if it's intended that the batch size is increased that way.

AntonioCarta commented 2 years ago

Ok, now I get it. Yes, it's normal, some of the dataloaders, like the TaskBalancedDataLoader add batch_size samples for each group (task/experience ...). Maybe we should rename the parameter to avoid confusion.