Feature request: zero overhead benchmarking

bluenote10 commented 5 years ago

Currently it is difficult to isolate setup code from the part of the code that one really wants to measure. Silly example:

proc silly(n: int) {.measure: [100, 10000, 1000000].} =
  let s = newSeq[int](n)
  let l = s.len # <-- measure just this?
  doAssert l == n

The measurement is obviously completely dominated by the setup code let s = newSeq[int](n). A work-around is to initialize data in the surrounding scope and close over the data variables. But this prevents to parametrize the benchmark and there seems to be an overhead of copying the data to the benchmark. There is also the general overhead:

proc overhead(b: bool) {.measure: [true].} =
  doAssert b

which is ~4 cycles for me.

What about the following API:

proc silly(n: int): Measure {.measure: [100, 10000, 1000000].} =
  let s = newSeq[int](n)
  timed:
    let l = s.len # <-- measure just this?
  doAssert l == n

Where timed is a template that internally measure the time/cycles before and after executing its body and updating result with the corresponding time/cycle deltas. This would allow to separate the setup & optimization prevention code from the actual benchmark code.

LemonBoy commented 5 years ago

The idea is to allocate the seq externally and then pass it to the measuring function. You could write an iterator that does so and then pass it to measure in order to perform the measurement only the wanted code.

There is also the general overhead

That's expected and is also expected to have a very low variance so that comparing different runs gives meaningful values.

bluenote10 commented 5 years ago

It would just be very convenient to have a timed block, because

External allocation does not allow to initialize each run individually/randomly. It would be nice if one could average over different instantiations as well. I think even with an iterator the data is generated only once, and there is another loop which repeatedly uses that data, right?
Isn't there an unknown overhead due to Nim potentially copying the data that is passed to the function? I assume that with a large seq I would include the time to copy the seq in the measurement, and that's what I would like to avoid.
With the timed block one wouldn't have to worry about cases where the post-measurement-assertion has non-negligible complexity.

LemonBoy commented 5 years ago

I think even with an iterator the data is generated only once, and there is another loop which repeatedly uses that data, right?

Yes, for each element of the iterator N samples are collected. If you want to measure how the function behaves for M different inputs just generate them beforehand with an iterator or whatever, changing the data in between runs is bound to generate garbage results and would defeat the whole point of this benchmarking scheme.

Isn't there an unknown overhead due to Nim potentially copying the data that is passed to the function? I assume that with a large seq I would include the time to copy the seq in the measurement, and that's what I would like to avoid.

Once the data is ready a closure is allocated internally and then passed to the "real" benchmark function. The only overhead you'll see is a few cycles due to the extra indirection (if the C compiler didn't inline everything).

With the timed block one wouldn't have to worry about cases where the post-measurement-assertion has non-negligible complexity.

The blackbox function should be enough to stop the compiler from optimizing away everything while keeping a very low overhead.

LemonBoy / criterion.nim

Feature request: zero overhead benchmarking #4