Benchmark infrastructure is broken

xukai92 commented 6 years ago

It's probably better for us to redesign it.

xukai92 commented 6 years ago

Requirement for new benchmark infrastructure (benchmarks/benchmarkhelper.jl)

Automatically benchmark models using Turing.jl and Stan.jl (using the same algorithm)
- Models for Turing.jl and Stan.jl are implemented already, but might not be 1.0 compatible.
- Turing.jl and Stan.jl shall share the same data for benchmarking (related issue: https://github.com/TuringLang/TuringExamples/issues/2)
- The data can be generated by scripts runs forward pass of the model (usually named with .sim in this repo).
- The inference results between Stan.jl and Turing.jl should be compared. This can be done by checking the statistics like mean and variance of samples.
- It's better to have interface that can run only a subset of models.
Automatically generate benchmark results
- The results should be a clean format which can be easily updated to wiki.
- The results should be also submitted to mLab.

xukai92 commented 6 years ago

The mLab point can be changed if we have a better way to automatically generate a benchmark results that can be tracked between difference commits.

mohamed82008 commented 6 years ago

Maybe relevant https://github.com/mohamed82008/ComputExp.jl.

xukai92 commented 6 years ago

Can you give me a brief intro about what's that repo is doing?

mohamed82008 commented 6 years ago

It is a database approach to managing computational experiments with a focus on reproducibility. An example is given here https://github.com/mohamed82008/ComputExp.jl/blob/master/test/TestComputationalExperiments.ipynb. This was a pet project a few months ago, but I never actually completed it.

So we can define programming languages, with versions and specific commits. We can define problems, algorithms and implementations of algorithms. An implementation is associated with a package dependency (can be extended to more), a specific commit of that package, a script file and a function name to be called when running the implementation. Then there are problems with problem features/parameters, and there are experiments that compare implementations on problems. Each experiment has many runs, where the run is for a specific implementation and a specific problem setting. Then finally each run has some results.

So once the database is generated, one can save it, load it again and do some stats or plotting on the results. At least that's the intention but I never got to making it fully functional yet.

xukai92 commented 6 years ago

This is interesting. But I guess it goes far beyond what we need here. Our intention to keep the automatically generated benchmark results is to keep track of performance change between commits. We probably want something pretty light for this purpose.

TuringLang / papers

Benchmark infrastructure is broken #1