`ReplayPlugin` is slow when `num_workers` > 0

ContinualAI / avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

http://avalanche.continualai.org

MIT License

1.75k stars 287 forks source link

`ReplayPlugin` is slow when `num_workers` > 0 #1348

Open HamedHemati opened 1 year ago

HamedHemati commented 1 year ago

This is an expected issue for the ReplayPlugin to be slower when the number of workers is larger than zero, especially when the size of the buffer's dataloader is much smaller than the current task's loader. This happens regardless of the storage policy type. Using num_workers>0 can be crucial for dataloaders with many augmentations or with large sample sizes.

One potential solution is to initialize both loaders only once before each experience. The max length can be set as the length of the tasks's loader, but instead of initializing the buffer's loader multiple times, we can set a cyclic sampler for the buffer with the same length as the task's loader when initializing it.

P.S.: _I couldn't reproduce the issue that we discussed earlier regarding the interruptions in the class-balanced buffer for the Split-CIFAR100 benchmark (bs=64, memsize=200). I only encounter that problem in a CIR scenario. I can't say for sure what cause the issue there, but in that particular case, a multi-group buffer has long interruptions at the beginning of the stream.

AntonioCarta commented 1 year ago

What should happen is:

if we create multiple dataloaders, num_workers>0 should indicate the maximum amount of workers. Instead, we create num_workers * num_groups. This slows down the initialization. Maybe we can rename the parameter to num_workers_per_group and forbid the use of num_workers? the alternative is to create a single dataloader, if possible.
We shuold reuse the dataloader as you suggest, with the cyclic sampler.

@HamedHemati are you going to fix this?

P.S.: I couldn't reproduce the issue that we discussed earlier regarding the interruptions in the class-balanced buffer for the Split-CIFAR100 benchmark (bs=64, mem_size=200). I only encounter that problem in a CIR scenario. I can't say for sure what cause the issue there, but in that particular case, a multi-group buffer has long interruptions at the beginning of the stream.

If you can reproduct it with CIR in the master branch please open a separate issue.

AntonioCarta commented 1 year ago

@HamedHemati any update on this? me and Lorenzo are working on some orthogonal changes to dataloaders but I think at least this:

if we create multiple dataloaders, num_workers>0 should indicate the maximum amount of workers. Instead, we create num_workers * num_groups. This slows down the initialization. Maybe we can rename the parameter to num_workers_per_group and forbid the use of num_workers? the alternative is to create a single dataloader, if possible.

should be fixed.

HamedHemati commented 1 year ago

@HamedHemati any update on this?

I checked the latest version of ReplayDataLoader and realized there are some new changes since the distributed training PR by @lrzpellegrini .

Changing the plugin to use a single loader would require some behavior changes. We need to first merge all datasets and using a customized sampler to sample each batch in a way that contains equal number of samples from all datasets, as long as there are unseen samples from any of the datasets. A simpler option would be to just shuffle all samples from all datasets and iterate through them. Although I'm not sure how the customized sampler would be merged with the distributed sampler in distributed training setup.

If we decide to go with the option to keep parallel loaders, to avoid dataloader re-initialization, we need to add a cyclic sampler to the dataloader of each group. We also need to set maximum length of the ReplayLoader equal to the "main" dataloader's length. Even in this case we would still need to deal with "merging" the cyclic sampler with the distributed sampler. Maybe @lrzpellegrini has some experience with this?

AntonioCarta commented 1 year ago

I think the best option is to merge the samplers and avoid creating dataloaders. Dataloaders have parallel processes internally, they are expensive to start and they may starve each other off resources.

HamedHemati commented 1 year ago

So this means that we should avoid using ReplayDataLoader?

There are still a few points left that needs to be discussed:

Do we want a random sampler (the PyTorch's default when shuffle=True), or some specific way of creating batches after merging the datasets?
For replay in OCL strategies, merging a big buffer with the dataset from a small OCL experience is probably not what users would expect. So I guess we can add an option (set to None by default) that allows the users to chose how many samples from the buffer dataset should be merged with the current experience before each experience training.
Do we want this for the upcoming release? I'm asking this because this plugin is one of the most used ones, and any changes need to be checked with caution.

AntonioCarta commented 1 year ago

So this means that we should avoid using ReplayDataLoader?

no, I'm saying that all the dataloaders that are creating NEW dataloaders INTERNALLY should instead have a single dataloader and use samplers to define custom behavior over multiple datasets whenever possible.

Do we want a random sampler (the PyTorch's default when shuffle=True), or some specific way of creating batches after merging the datasets?

Whatever we used before, respecting user choices if a sampler is passed explicitly.

For replay in OCL strategies, merging a big buffer with the dataset from a small OCL experience is probably not what users would expect. So I guess we can add an option (set to None by default) that allows the users to chose how many samples from the buffer dataset should be merged with the current experience before each experience training.

You are assuming random sampling over the concatenation, which is not what I'm saying. The sampler will still pick a bunch of indices independently for the two datasets, it's just returning them sequentially (example: [idx_D1, idx_D2, idx_D1, ...])

Do we want this for the upcoming release?

No, it's for a future release.