Closed eregon closed 4 years ago
Some of my test will be slower than 1ms, so I guess I still want to set some minimum or control the 1ms minimum (e.g. set it to 1 second, 10 seconds, etc). Can we expose it as configuration/options?
A huge amount of benchmarks will be greater than 1ms per iteration, so is this problem only one of a very small benchmark? What does the PR look like when the block takes, say 2ms to run?
I guess to put it another way, this PR doesn't change the behavior of a block that runs for 1ms or greater. Should we also be smoothing warming for those larger blocks?
@ioquatix I'm curious, why would you like to control that? What difference would it make, except overhead which I think this PR addresses?
I think it's best and simpler to keep a single parameter for warmup, the existing x.config(warmup: 2)
.
For blocks that run longer than 1ms, I would think the overhead of going through call_times
on each iteration during warmup is negligible.
@evanphx But indeed, we should avoid calling call_times
always with 1 during warmup also for blocks that take longer than 1ms (to avoid recompilation as mentioned above).
We could just keep doubling the number of cycles
until it's larger than the configured warmup. That would take approximately warmup + warmup / 2 + warmup / 4 + ... = 2 * warmup
though (2x longer than the configured warmup). So we could do it with half the configured warmup time, and it should end up running for about the configured warmup time. I'll experiment with that.
If the block takes longer than the warmup time, I think we can't really do much. At that point, it's unlikely that call_times
will compile as it would only be called a few times with a very small number of cycles, so it probably doesn't matter much.
I pushed a new version which will double cycles until up to half of the warmup time is spent, and then run the remaining of warmup with the number of cycles for 100ms, just like run_benchmark
.
The main advantage is this still runs warmup for the configured warmup time (at worse 100ms less, we could choose to run for at least the warmup time, which can 100ms longer if wanted).
This is a bit more complicated than the previous versions, but it seems worth it:
This looks good, I'm happy to merge this (and I'll get CI sorted out, the failures are not yours). Would you rather have this than #95 then?
@evanphx Great!
@evanphx I replied on https://github.com/evanphx/benchmark-ips/pull/95#issuecomment-570798729 with a summary, please review that PR too based on that summary and the PR description.
@evanphx Gentle ping, could you merge this PR and #95? Is there anything else I should do?
If you want the CI to be green first, I could give it a shot with GitHub Actions and https://github.com/ruby/setup-ruby, or just remove old Rubies no longer working on TravisCI.
@evanphx If it helps I would be happy to help maintain this gem by, e.g., fixing the CI and reviewing PRs.
At this point what I'd really like is some answer. Waiting 5+ months to merge a PR on such an important gem seems unfortunate.
run_warmup
always callscall_times
with cycles=1 and the JIT might speculate on that to remove the loop, but thenrun_benchmark
suddenly callscall_times
with cycles > 1, which needs the loop and therefore causes deoptimization and later recompilation, during the actual measurements.This PR addresses the second issue of the benchmark shown in #95. I believe it also addresses #94, cc @ioquatix
When running this benchmark on MRI with current master, I see:
I noticed the warming times, which are per 100ms are not really close to a 10th of the measurement times, which are in 1000ms (1 second). This sounds surprising, as on MRI for this benchmark I would expect no difference between warmup and actual measurement. Yet, we see
env == 'development'
is 426k i/100ms (= 4.26M i/s) during warmup and then it's 10.818M i/s during measurements. Did MRI get magically more than twice faster when it realized we are actually benchmarking and not just warming up? I would not think so.The reason is the warmup phase uses
call_times(cycles=1)
while the measurement phase usescall_times(cycles)
withcycles
typically far great than 1 (in this case, around 200 000). Usingcall_times(1)
has a significant overhead as shown here, because every time we need to go in call_times: https://github.com/evanphx/benchmark-ips/blob/0bb23ea1f5d8f65416629505889f6dfc550fa4a6/lib/benchmark/ips/job/entry.rb#L46-L56 Read a few instance variables, go in the loop for just one call, exit the loop and return fromcall_times
.So instead, if we adapt the warmup to call
call_times
with enough cycles to take at least 1ms, we will reduce the overhead significantly, as this PR does:Now warmup and measurement timings are consistent and make sense.
Of course, nobody should use only warmup times for interpreting results, but nevertheless I believe it's good for warmup and measurements times to match, as it can be an indication of whether enough warmup happened and how stable is the code being benchmarked.
The same issue also happens on TruffleRuby (because the warmup loop cannot be compiled efficiently with the previous code, only with OSR after many many iterations). Without this PR:
With this PR:
And warmup is then consistent with measurements (instead of being apparently much slower).
cc @chrisseaton