download operation creates unnecessary noise in benchmarking results

ejgallego commented 7 years ago

Unless I am missing something, using opam install is not suitable for benchmarking. Among other things, it will call constraint solvers, try to download packages, and thus could introduce significant differences on timing.

@maximedenes, what do you think?

ghost commented 7 years ago

That and other things:

at the moment, there is no way how I could exclude the time it takes to download the package from the measurements

The benefits (handling of dependencies) still, even in this state, seem to outweight the drawbacks.

At the moment, we are evaluating the level of noise. By that I mean that for the chosen branch what I do is I check the results we get if we compile all packages once, or twice, or thrice, or four times. That will give us the idea how many iterations are enough (wrt. to the desired precision).

Anyway, the time to resolve the dependencies and to download the stuff is, fortunatelly for the macro-benchmarks, negligible wrt. the whole compilation time.

Also, the plan is to look at the implementation of the opam command and add subcommands that we need. What I need (and I do not yet have) is the ability to tell opam just to download the stuff and do nothing else.

ghost commented 7 years ago

opam already provides one thing that is essential ... before we start benchmarking of the chosen package, we are able to install all its dependencies and thus exclude the compilation of those dependencies from the benchmarking.

ghost commented 7 years ago

The better counterargument against relying on opam is, I think, the fact that:

I must create a fake OPAM repository
to which I need to generate Coq OPAM packages that represents exactly those commits we want to measure

The implementation, in that respect, isn't particularly appealing. I am not very happy about it.

ejgallego commented 7 years ago

Anyway, the time to resolve the dependencies and to download the stuff is, fortunatelly for the macro-benchmarks, negligible wrt. the whole compilation time.

That is not true in my experience. Also, in your iteration scheme, opam caching will quick in, distorting the results based on the size of download.

maximedenes commented 7 years ago

I agree it is not so clear if the gains outweight the drawbacks. Maybe we could try a hybrid approach, like using opam to install dependencies, but do the build to be benchmarked manually. This way, we would not need to create a fake package, we would have more control on what we measure, but could still rely on OPAM to solve dependencies.

ejgallego commented 7 years ago

We could certainly share the build recipes in dev/ci, I agree that using opam for dependencies is OK.

ejgallego commented 7 years ago

Futhermore, we should ideally produce time reports by file. Most projects use coq_makefile so that should be doable.

ejgallego commented 7 years ago

Changes in time by .v file.

ghost commented 7 years ago

That is a good idea.

ejgallego commented 7 years ago

In the future we will be able to track even changes by section, and sentence (sercomp will support that).

ghost commented 7 years ago

I've added your ideas to the TODO list.

ejgallego commented 7 years ago

I would rather have the issue closed when it is actually solved (hint, other people interested can search, collaborate and comment) but I guess it is your call.

ejgallego commented 7 years ago

It seems to me indeed that a feasible approach would be to teach coq_makefile to instrument developments for detailed bechmarking. cf coq/coq#406

In the end, we may want to use python, R, or some other higher-level language thou if we are going to produce so much data. bash is just not gonna cut it IMHO.

ghost commented 7 years ago

@ejgallego To decrease the confusion a little bit, do you think that you can reiterate points that led you to conclude:

Opam install not fully suitable for benchmarking.

Several things (related/unrelated) were mentioned and even the original focus shifted a little bit. So restatement might help us, I think.

ghost commented 7 years ago

The way how I originally planned to address this problem is to fix this issue at the root so that we can keep the benefits opam provides.

ejgallego commented 7 years ago

That looks good, given the scope of the benchmarking it may make sense for now.

However, if/when we want to do fancier tricks such as calling coq_makefile --benchmark we will either have to modify opam, or think some kind of trick, (maybe and env variable?)

So let's keep an eye on it.

we can keep the benefits opam provides.

I am not so convinced yet that opam provides many benefits for testing bleeding edge code. A good example is iris, they had to roll out their own package system as opam had not the proper granularity cf https://gitlab.mpi-sws.org/FP/iris-coq/issues/83

ghost commented 7 years ago

We should make sure that the "download time" does not influence benchmarking of "build time".

ejgallego commented 7 years ago

Be aware that when using opam there are other aspects beyond download time that may influence the timing, hence the original title was more accurate.

ejgallego commented 5 years ago

This is still an issue, and a serious one it seems

JasonGross commented 4 years ago

Note that if you call make TIMED=1 --output-sync on a coq-makefile-made Makefile, the python scripts I have can process the logs of the resulting build and display per-file timing diffs as well as whole-project timing diffs. (We unfortunately don't get CPU cycles nor mem faults, etc.)

ppedrot commented 4 years ago

FWIW I'm also using a script to filter the HTML output of the bench, which sets the TIMED variable and allows to retrieve a per-line diff. It's essentially the same as the one provided in the repo, except that it's quasi-linear in the number of lines. (The one from the bench is cubic IIRC.)

coq / coq-bench

download operation creates unnecessary noise in benchmarking results #1