JuliaDynamics / ABMFrameworksComparison

Benchmarks and comparisons of leading ABM frameworks
Other
13 stars 8 forks source link

Seed all the models #7

Closed rht closed 1 year ago

rht commented 3 years ago

It would be nice to seed the models in all the frameworks so that anyone looking to reproduce the benchmark can obtain an identical result up to some floating point uncertainty.

So far, only some of the models are seeded

$ rg seed
Mesa/Schelling/benchmark.py
14:random.seed(2)

Mason/Schelling/Schelling.java
131:    public Schelling(final long seed) {
132:        this(seed, 50, 50);
135:    public Schelling(final long seed, final int width, final int height) {
136:        super(seed);

Mesa/ForestFire/benchmark.py
14:random.seed(2)

Mason/Flocking/FlockersBenchmark.java
168:    public FlockersBenchmark(final long seed) {
169:        super(seed);
Libbum commented 3 years ago

I'm not really sold on this, in fact I removed seeds in the Agents solutions just recently. We're running 100 or so samples most of the time & take the minimum runtime. If you run a model with the same seed, you have no reason to do such an ensemble and therefore lose some inference of how the framework handles these slight variations.

I'd advocate for removing these seeds you've identified instead.

@Datseris, what's your take on this?

rht commented 3 years ago

If you run a model with the same seed, you have no reason to do such an ensemble and therefore lose some inference of how the framework handles these slight variations.

Not sure I understand this sentence correctly. You meant, even with 100 different seeds for the 100 samples, you still can no longer infer how the framework handles slight variations?

Libbum commented 3 years ago

Well, yes I can do that, but that's what a profiler is supposed to do. I guess I don't follow why I would need to generate the seeds manually when the profiler can do that.

Datseris commented 3 years ago

I'm not really sold on this, in fact I removed seeds in the Agents solutions just recently. We're running 100 or so samples most of the time & take the minimum runtime. If you run a model with the same seed, you have no reason to do such an ensemble and therefore lose some inference of how the framework handles these slight variations.

I'd advocate for removing these seeds you've identified instead.

I agree. A benchmark comparison should run the model with many different seeds, preferably random, to provide a more holistic view on performance. This is in fact the general practice when benchmarking and profiling your models; you don't want to look at only a specific initialization as this might be missleading.

I do not think that this makes the benchmark "less reproducible", but I counter argue that it makes it "more transparent", as we have taken into account fluctuations in performance due to different seeds. Besides, if someone wanted to replicate the benchmark comparisons, they also need to run 100s of seeds as well, which makes it extremely unlikely that they won't get teh same result as us. If you have done this and indeed do not get the same results, then please post the example here and we can investigate further.

rht commented 3 years ago

If the 100 seeds are specified in the code, the reproducer will get an identical result. If someone wants to check whether the seeds are deliberately chosen to be in favor of Agents.jl, they can easily choose their own 100 seeds. Bonus: if they happen to find a conflicting result, they can conveniently post their seeds here to be verified, and the result is not lost into the pit forever. If the 100 seeds are specified, reproducibility is achieved, without sacrificing transparency in taking into account of the fluctuations.

rht commented 3 years ago

Well, yes I can do that, but that's what a profiler is supposed to do. I guess I don't follow why I would need to generate the seeds manually when the profiler can do that.

This seems to be Julia-specific that the profiler has the same set of seeds everywhere. Not necessarily true in the implementation of other platforms.

Libbum commented 3 years ago

Perhaps I didn't explain myself clearly enough, but the Julia profiler definitely does not run with the same set of seeds.

Reproducibility for benchmarks is not a goal when comparing frameworks like this. The whole point of doing a number of runs is to get a statistical result, not an explicit one. If we were to keep the same methodology but make it reproducible as you're suggesting, then ultimately the profiler is irrelevant. The benchmark is just 'find the fastest seed for this model'.

rht commented 3 years ago

If we were to keep the same methodology but make it reproducible as you're suggesting, then ultimately the profiler is irrelevant. The benchmark is just 'find the fastest seed for this model'.

I don't see a problem in picking e.g. 2 as a starting seed for each model and just stick to it. No finding the fastest seed is required. And you still get a statistical result over the 100 repetitions as long as at the beginning of each simulation, the system is not reseeded with an identical seed.

... but I counter argue that it makes it "more transparent", as we have taken into account fluctuations in performance due to different seeds.

From reading the code, for the Agents.jl code, it is unclear whether @benchmark actually uses different seeds for each of its individual runs. The RNG could be possibly seeded only once for all.

Actually, the BenchmarkTools.jl manual itself states:

If you use rand or something similar to generate the values that are used in your benchmarks, you should seed the RNG (or provide a seeded RNG) so that the values are consistent between trials/samples/evaluations.

While this tips doesn't necessarily apply to all cases, I don't think this repo's use case is an exception to the rule of thumb.

Datseris commented 3 years ago

okay, I'm convinced, feel free to open a PR that adds seeds everywhere if you feel like it, otherwise we will do it at some point in the future; at the moment we have some other priorities and this will not change any of our comparison results with respect to which software is faster, thus it is not a priority.

Tortar commented 1 year ago

I think the metodology @rht is suggesting is sound, I will then do this when I have time:

seed the rng with 100 different seeds from a seeded RandomTwister (maybe the usual 42 is fine) in each framework (if possible)

Tortar commented 1 year ago

Closed via #22

rht commented 1 year ago

Thank you for making it happen!