Table notation for reproducibility

ContinualAI / continual-learning-baselines

Continual learning baselines and strategies from popular papers, using Avalanche. We include EWC, SI, GEM, AGEM, LwF, iCarl, GDumb, and other strategies.

https://avalanche.continualai.org/

MIT License

261 stars 38 forks source link

Table notation for reproducibility #35

Open AntonioCarta opened 1 year ago

AntonioCarta commented 1 year ago

I propose to switch the notation. Right now we have:

✅ Reproduced
❌ Custom setup
bug for bugs

IMO, this is very confusing at a first glance. If I see a big red cross I immediately think there is a problem with the strategy. In this case, everything is actually correct, we just changed some hyperparameters or tested a new benchmark.

Instead we could have two separate columns:

Reproduced: ✅ if correct, ❌ if bugged
Reference: link to the paper, or link to avalanche or custom tag if not using any paper.

AndreaCossu commented 1 year ago

The current meaning is actually different: tick = we are able to reproduce the target performance of the reference paper (we do not necessarily use the same setup of the reference paper) cross = we are not able to reproduce the target performance of the reference paper. We do not know if this is due to a bug in the strategy. bug = we are not able to reproduce the target performance of the reference paper due to a bug in the strategy, for sure.

AntonioCarta commented 1 year ago

Ok, I misunderstood the notation. Maybe we should add how far we are from the target result?

AndreaCossu commented 1 year ago

Yes, we can. I didn't want to clutter the table so I put the reference performance inside the comments in the experiments. I think we could create a separate table in the README to briefly show the gap. I also created issue #33 to keep track of what's missing. I could also add the gap there.

AntonioCarta commented 1 year ago

Maybe we need to strictly separate two types of experiments:

paper reproductions which are exactly reproducing a paper
baselines which provides clean implementations but may have a lower accuracy.

IMO CLB is still valuable as long as the methods in avalanche are correct and the clean implementation provides a reasonable reference value. Reproducing papers requires digging into whatever tricks the authors decided to add. While useful, it's very time consuming and we cannot afford to do it ourselves, as we have already seen. Of course we can support external contributions on this.

AndreaCossu commented 1 year ago

With paper reproductions do you also mean same hyperparameters as original paper? In the end, I think that is less interesting (and we would have only few strategies marked as such). One would probably use CL baselines to understand how to reach the same performance as the original paper, even though hyperparameters may differ. I guess that better describes the concept of reproducibility when you use a different codebase than the one you are trying to reproduce.

AntonioCarta commented 1 year ago

With paper reproductions do you also mean same hyperparameters as original paper?

Same performance, scenario, model architectures, and so on. Some hyperparameters (lr, regularization strength) may change due to minor differences in the framework/implementation.

AndreaCossu commented 1 year ago

I changed the table in the README. It now shows Avalanche when the experiment is not present in a specific paper. I also added the reference performance with the related paper (when available).

AntonioCarta commented 1 year ago

This is a nice improvement. Do we have any explanation about the gaps of some experiments? e.g. different hparams, less epochs,...

AndreaCossu commented 1 year ago

Not really, we can speculate but nothing more at the moment.

AntonioCarta commented 1 year ago

It's fine, but we should keep track of this somewhere. At least a log with attempts, some notes about what failed. Not sure about the form of it, a comment in the header of the script may be enough.

For example, maybe we find out that the difference is due to a mistake in the original paper (e.g. they look at the validation instead of test loss). In this case, we should explain the reason behind the performance difference.