eregon commented 4 years ago

That way, warmup is more similar to actual measurements and not wildly different due to the overhead of going through #call_times in and out on every cycle.
This also communicates to the JIT that cycles is not always 1, which means the method compiled during warmup is more likely to be re-used. Before, the compiled method could deoptimize because run_warmup always calls call_times with cycles=1 and the JIT might speculate on that to remove the loop, but then run_benchmark suddenly calls call_times with cycles > 1, which needs the loop and therefore causes deoptimization and later recompilation, during the actual measurements.

This PR addresses the second issue of the benchmark shown in #95. I believe it also addresses #94, cc @ioquatix

# frozen_string_literal: true
require 'benchmark/ips'

module ActiveSupport
  class StringInquirer < String
    def method_missing(method_name, *arguments)
      if method_name[-1] == "?"
        self == method_name[0..-2]
      else
        super
      end
    end
  end
end

def env
  @_env ||= ActiveSupport::StringInquirer.new("development")
end

Benchmark.ips do |x|
  x.iterations = 2

  x.report("env.development?") do
    env.development?
  end

  x.report("env == 'development'") do
    env == 'development'
  end
end

When running this benchmark on MRI with current master, I see:

$ ruby -I/home/eregon/code/benchmark-ips/lib bench_dev.rb
Warming up --------------------------------------
    env.development?   200.814k i/100ms
env == 'development'   423.283k i/100ms
    env.development?   201.198k i/100ms
env == 'development'   426.050k i/100ms
Calculating -------------------------------------
    env.development?      2.783M (± 1.9%) i/s -     14.084M in   5.062861s
env == 'development'     10.756M (± 1.3%) i/s -     54.108M in   5.031362s
    env.development?      2.782M (± 1.2%) i/s -     14.084M in   5.062568s
env == 'development'     10.818M (± 1.0%) i/s -     54.108M in   5.002086s

I noticed the warming times, which are per 100ms are not really close to a 10th of the measurement times, which are in 1000ms (1 second). This sounds surprising, as on MRI for this benchmark I would expect no difference between warmup and actual measurement. Yet, we see env == 'development' is 426k i/100ms (= 4.26M i/s) during warmup and then it's 10.818M i/s during measurements. Did MRI get magically more than twice faster when it realized we are actually benchmarking and not just warming up? I would not think so.

The reason is the warmup phase uses call_times(cycles=1) while the measurement phase uses call_times(cycles) with cycles typically far great than 1 (in this case, around 200 000). Using call_times(1) has a significant overhead as shown here, because every time we need to go in call_times: https://github.com/evanphx/benchmark-ips/blob/0bb23ea1f5d8f65416629505889f6dfc550fa4a6/lib/benchmark/ips/job/entry.rb#L46-L56 Read a few instance variables, go in the loop for just one call, exit the loop and return from call_times.

So instead, if we adapt the warmup to call call_times with enough cycles to take at least 1ms, we will reduce the overhead significantly, as this PR does:

$ ruby -I/home/eregon/code/benchmark-ips/lib bench_dev.rb
Warming up --------------------------------------
    env.development?   274.007k i/100ms
env == 'development'     1.049M i/100ms
    env.development?   274.043k i/100ms
env == 'development'     1.035M i/100ms
Calculating -------------------------------------
    env.development?      2.787M (± 6.0%) i/s -     13.976M in   5.041526s
env == 'development'     10.717M (± 0.8%) i/s -     53.813M in   5.021777s
    env.development?      2.820M (± 0.8%) i/s -     14.250M in   5.053428s
env == 'development'     10.621M (± 0.6%) i/s -     53.813M in   5.067114s

Now warmup and measurement timings are consistent and make sense.

Of course, nobody should use only warmup times for interpreting results, but nevertheless I believe it's good for warmup and measurements times to match, as it can be an indication of whether enough warmup happened and how stable is the code being benchmarked.

The same issue also happens on TruffleRuby (because the warmup loop cannot be compiled efficiently with the previous code, only with OSR after many many iterations). Without this PR:

Warming up --------------------------------------
    env.development?   737.218k i/100ms
env == 'development'     1.900M i/100ms
    env.development?   945.329k i/100ms
env == 'development'     2.203M i/100ms
Calculating -------------------------------------
    env.development?     14.695M (±23.7%) i/s -     68.064M in   5.049277s
env == 'development'      1.219B (± 5.9%) i/s -      6.007B in   4.995364s
    env.development?     14.532M (±25.5%) i/s -     66.173M in   5.010910s
env == 'development'      1.243B (± 2.9%) i/s -      6.206B in   4.997811s

With this PR:

Warming up --------------------------------------
    env.development?     1.218M i/100ms
env == 'development'   107.143M i/100ms
    env.development?     1.518M i/100ms
env == 'development'   121.856M i/100ms
Calculating -------------------------------------
    env.development?     14.787M (±16.5%) i/s -     72.850M in   5.070296s
env == 'development'      1.244B (± 0.8%) i/s -      6.336B in   5.092835s
    env.development?     14.425M (±17.4%) i/s -     69.814M in   5.000543s
env == 'development'      1.233B (± 2.2%) i/s -      6.215B in   5.044568s

And warmup is then consistent with measurements (instead of being apparently much slower).

cc @chrisseaton

ioquatix commented 4 years ago

Some of my test will be slower than 1ms, so I guess I still want to set some minimum or control the 1ms minimum (e.g. set it to 1 second, 10 seconds, etc). Can we expose it as configuration/options?

evanphx commented 4 years ago

A huge amount of benchmarks will be greater than 1ms per iteration, so is this problem only one of a very small benchmark? What does the PR look like when the block takes, say 2ms to run?

evanphx commented 4 years ago

I guess to put it another way, this PR doesn't change the behavior of a block that runs for 1ms or greater. Should we also be smoothing warming for those larger blocks?

eregon commented 4 years ago

@ioquatix I'm curious, why would you like to control that? What difference would it make, except overhead which I think this PR addresses?

I think it's best and simpler to keep a single parameter for warmup, the existing x.config(warmup: 2).

For blocks that run longer than 1ms, I would think the overhead of going through call_times on each iteration during warmup is negligible.

@evanphx But indeed, we should avoid calling call_times always with 1 during warmup also for blocks that take longer than 1ms (to avoid recompilation as mentioned above).

We could just keep doubling the number of cycles until it's larger than the configured warmup. That would take approximately warmup + warmup / 2 + warmup / 4 + ... = 2 * warmup though (2x longer than the configured warmup). So we could do it with half the configured warmup time, and it should end up running for about the configured warmup time. I'll experiment with that.

If the block takes longer than the warmup time, I think we can't really do much. At that point, it's unlikely that call_times will compile as it would only be called a few times with a very small number of cycles, so it probably doesn't matter much.

eregon commented 4 years ago

I pushed a new version which will double cycles until up to half of the warmup time is spent, and then run the remaining of warmup with the number of cycles for 100ms, just like run_benchmark. The main advantage is this still runs warmup for the configured warmup time (at worse 100ms less, we could choose to run for at least the warmup time, which can 100ms longer if wanted).

This is a bit more complicated than the previous versions, but it seems worth it:

The original PR: https://github.com/evanphx/benchmark-ips/compare/master...eregon:warmup-more-than-1-cycle-v1 does not handle blocks taking longer than 1ms.
A second version, which just keeps doubling cycles until it would exceed warmup time: https://github.com/evanphx/benchmark-ips/compare/master...eregon:warmup-more-than-1-cycle-v2 The main issue is that PR causes warmup to run for significantly less than the configured value (at worse 1/3 of the configured time).

evanphx commented 4 years ago

This looks good, I'm happy to merge this (and I'll get CI sorted out, the failures are not yours). Would you rather have this than #95 then?

eregon commented 4 years ago

@evanphx Great!

95 is a separate/orthogonal fix, I'll reply there soon.

eregon commented 4 years ago

@evanphx I replied on https://github.com/evanphx/benchmark-ips/pull/95#issuecomment-570798729 with a summary, please review that PR too based on that summary and the PR description.

eregon commented 4 years ago

@evanphx Gentle ping, could you merge this PR and #95? Is there anything else I should do?

If you want the CI to be green first, I could give it a shot with GitHub Actions and https://github.com/ruby/setup-ruby, or just remove old Rubies no longer working on TravisCI.

eregon commented 4 years ago

@evanphx If it helps I would be happy to help maintain this gem by, e.g., fixing the CI and reviewing PRs.

At this point what I'd really like is some answer. Waiting 5+ months to merge a PR on such an important gem seems unfortunate.

evanphx / benchmark-ips

Run more than just 1 cycle during warmup to reduce overhead #96

95 is a separate/orthogonal fix, I'll reply there soon.