Open AntonioCarta opened 1 year ago
The current meaning is actually different: tick = we are able to reproduce the target performance of the reference paper (we do not necessarily use the same setup of the reference paper) cross = we are not able to reproduce the target performance of the reference paper. We do not know if this is due to a bug in the strategy. bug = we are not able to reproduce the target performance of the reference paper due to a bug in the strategy, for sure.
Ok, I misunderstood the notation. Maybe we should add how far we are from the target result?
Yes, we can. I didn't want to clutter the table so I put the reference performance inside the comments in the experiments. I think we could create a separate table in the README to briefly show the gap. I also created issue #33 to keep track of what's missing. I could also add the gap there.
Maybe we need to strictly separate two types of experiments:
IMO CLB is still valuable as long as the methods in avalanche are correct and the clean implementation provides a reasonable reference value. Reproducing papers requires digging into whatever tricks the authors decided to add. While useful, it's very time consuming and we cannot afford to do it ourselves, as we have already seen. Of course we can support external contributions on this.
With paper reproductions do you also mean same hyperparameters as original paper? In the end, I think that is less interesting (and we would have only few strategies marked as such). One would probably use CL baselines to understand how to reach the same performance as the original paper, even though hyperparameters may differ. I guess that better describes the concept of reproducibility when you use a different codebase than the one you are trying to reproduce.
With paper reproductions do you also mean same hyperparameters as original paper?
Same performance, scenario, model architectures, and so on. Some hyperparameters (lr, regularization strength) may change due to minor differences in the framework/implementation.
I changed the table in the README. It now shows Avalanche
when the experiment is not present in a specific paper. I also added the reference performance with the related paper (when available).
This is a nice improvement. Do we have any explanation about the gaps of some experiments? e.g. different hparams, less epochs,...
Not really, we can speculate but nothing more at the moment.
It's fine, but we should keep track of this somewhere. At least a log with attempts, some notes about what failed. Not sure about the form of it, a comment in the header of the script may be enough.
For example, maybe we find out that the difference is due to a mistake in the original paper (e.g. they look at the validation instead of test loss). In this case, we should explain the reason behind the performance difference.
I propose to switch the notation. Right now we have:
bug
for bugsIMO, this is very confusing at a first glance. If I see a big red cross I immediately think there is a problem with the strategy. In this case, everything is actually correct, we just changed some hyperparameters or tested a new benchmark.
Instead we could have two separate columns:
custom
tag if not using any paper.