Compile a #call_times method for each Job::Entry instance

eregon commented 4 years ago

That way, there is no polymorphism for @action.call inside #call_times and the inline cache for @action.call remains monomorphic and well optimized, independent of the job order.

While running this fairly simple benchmark on MRI and TruffleRuby I noticed 2 issues.

# frozen_string_literal: true
require 'benchmark/ips'

module ActiveSupport
  class StringInquirer < String
    def method_missing(method_name, *arguments)
      if method_name[-1] == "?"
        self == method_name[0..-2]
      else
        super
      end
    end
  end
end

def env
  @_env ||= ActiveSupport::StringInquirer.new("development")
end

Benchmark.ips do |x|
  x.iterations = 2

  x.report("env.development?") do
    env.development?
  end

  x.report("env == 'development'") do
    env == 'development'
  end
end

This PR addresses the first issue. The second is #96. The first issue, which applies more to Ruby implementations able to inline Ruby methods like TruffleRuby and JRuby, is that the call to @action.call inside call_times (the innermost loop of benchmark-ips) becomes polymorphic: https://github.com/evanphx/benchmark-ips/blob/0bb23ea1f5d8f65416629505889f6dfc550fa4a6/lib/benchmark/ips/job/entry.rb#L49-L55

It is important for performance to compile that loop and let the JIT know exactly which Proc will be called inside the loop. However, with the current code, all Procs given to Benchmark::IPS::Job#report will be observed for that act.call call site during warmup. And then for run_benchmark the inline cache will have as many entries as blocks given to report, which results in something like this:

          i = 0
          while i < times
            # act.call is compiled to something like
           if act.equal? FIRST_PROC
             FIRST_PROC.call
           elsif act.equal? SECOND_PROC
             SECOND_PROC.call
           ...
             ...
           end
            i += 1
          end

This is not good for performance, because those checks for the Proc make the loop body much larger, and the JIT will be less happy to inline any of these Proc#call since it's not just one but a lot of them and it's unclear which one is to prioritize (and might cause unfair inlining for the different blocks).

What we want instead is a copy of the loop for each block to measure, which we can easily achieve by defining a method on the singleton class of each Job::Entry, which was already done for report(String) but not other cases. That way we end up with:

          i = 0
          while i < times
           if act.equal? PROC_OF_THAT_ENTRY
             PROC_OF_THAT_ENTRY.call # very likely to be inlined,
             # which is critical for performance to avoid a non-inlined call
             # in the innermost loop and optimize the loop together with the Proc
           else
             # deoptimize
           end
            i += 1
          end

So in practice, it means running this benchmark on TruffleRuby 19.3.0 without this change gives:

$ ruby -v                                                
truffleruby 19.3.0, like ruby 2.6.2, GraalVM CE Native [x86_64-linux]
$ ruby -I/home/eregon/code/benchmark-ips/lib bench_dev.rb
Warming up --------------------------------------
    env.development?   739.502k i/100ms
env == 'development'     1.756M i/100ms
    env.development?   814.826k i/100ms
env == 'development'     2.161M i/100ms
Calculating -------------------------------------
    env.development?     13.419M (±25.2%) i/s -     61.927M in   5.036341s
env == 'development'     74.301M (± 5.5%) i/s -    371.632M in   5.016977s
    env.development?     14.518M (±24.0%) i/s -     67.631M in   5.025972s
env == 'development'     76.262M (± 4.0%) i/s -    382.435M in   5.022961s

And with this change:

Warming up --------------------------------------
    env.development?   737.218k i/100ms
env == 'development'     1.900M i/100ms
    env.development?   945.329k i/100ms
env == 'development'     2.203M i/100ms
Calculating -------------------------------------
    env.development?     14.695M (±23.7%) i/s -     68.064M in   5.049277s
env == 'development'      1.219B (± 5.9%) i/s -      6.007B in   4.995364s
    env.development?     14.532M (±25.5%) i/s -     66.173M in   5.010910s
env == 'development'      1.243B (± 2.9%) i/s -      6.206B in   4.997811s

We see the reported performance of env == 'development' radically changes with this PR. In fact, env == 'development' is well optimized, as it should and is in real code, and the PR lets it do that, while before benchmark-ips itself would prevent running it a full speed due to internal polymorphism.

cc @chrisseaton

eregon commented 4 years ago

I pushed an extra commit to avoid caching of class_eval, which some interpreter might do (TruffleRuby currently caches eval but not class_eval).

This is ready for review.

kbrock commented 4 years ago

I wonder if it would change your numbers if you had 'development' as a constant.

DEV='development'.freeze

  x.report("env == 'development' const") do
    env == DEV
  end

But since that call may be optimized away, then it may be moot.

nice submission

eregon commented 4 years ago

@kbrock The file uses # frozen_string_literal: true so it's the same as a constant.

headius commented 4 years ago

Shouldn't the benchmark harness explicitly try to avoid inlining the blocks it calls? If the block inlines and does nothing different from iteration to iteration, it may be optimized away...and then the benchmark measures nothing. I'm not sure I agree that the benchmark harness should itself be hacked to optimize away, since that basically would defeat the measurement of what's happening within the block.

Put differently: the purpose of this tool is to benchmark what's happening within the block. If the block invocations can optimize across calls, then it's no longer benchmarking what's in the block...it's benchmarking the block within this particular call pattern.

I'm all for reducing the overhead of the IPS loops, and ideally trying to get them to optimize as well as can be done...but without optimizing the loop AND the benchmark bodies together, since that breaks the tool.

headius commented 4 years ago

Note that the blocks in IPS benchmarks also can take an iteration count you can use to localize the loop yourself. This achieves your goal of optimizing the entire loop + benchmark body away, if that's really what you want to measure.

Bottom line for me is whether this tool is for benchmarking "how fast can you run a loop that calls this body of code" or "how fast can you run this body of code". The former might reduce down to a loop that does nothing, as in this case. That doesn't tell anyone anything about the performance of that body of code in isolation.

chrisseaton commented 4 years ago

It's tricky, isn't it.

Maybe what we need is a more systematic way to avoid benchmarks optimising away, rather than taking advantage of the fact that something happens to not be optimised deeply today, because it might be optimised more deeply tomorrow, whether we help it or not. Perhaps a library of opaque sources and sinks. But then optimising compilers can for example defeat an opaque source by profiling that the value is in practice constant even when it comes from what appears to be a dynamic source.

For example, even if a block is not inlined as you're suggesting, two specialisations of it could still be compiled from the caller - one where it is requested to produce a value, which is used for the last iteration, and one for the iterations where the value isn't used and so it isn't requested to produce a value.

Or even a system call that always returns a constant value may be constant folded (the system call will still be performed, and there will be a guard but the return value edge will be made constant and subsequent computation that uses it will be optimised away.)

I'm not sure how you can ever win when all this is tractable and some of already happens in practice!

The only real answer is probably that people must manually inspect their benchmarks to ensure that they are not being optimised away more than they wanted. But who's got the expertise to be doing that?

Or perhaps all benchmarks should use an out-of-process random number somehow, and should write to an out-of-process file handle somehow. They then can end-up dominated by IO though.

That's not really an argument in either direction in this debate - it's just a comment that the problem is probably deeper than whether one block is inlined or not.

enebo commented 4 years ago

My 2cents...

8fcafdc matches the intent of what benchmark-ips advertises. Each measure {} or entry of work should not share the same physical code and turn into something which partially measures polymorphic dispatch.

750a660 seems like you are working around a design decision. I say you may have to reconsider sharing of evals internally.

headius commented 4 years ago

@enebo My point is that the loop should not be part of the benchmark, and this patch makes it easier for it to be optimized as part of the benchmark. Yes, the loop time should not be included in results. By the same token, the overhead of dispatching to the body of code should not be included in measurements. But by making it easier to inline the benchmark loop plus the dispatch plus the benchmark body, this patch change what ips is actually measuring. It's not measuring how many times this body of code can run in a second; it's measuring how well a statically optimizable loop calling this body of code can run.

@chrisseaton It is tricky indeed. My concern with this patch is that we're making it even easier for IPS to report totally useless results. In implementations that aggressively inline blocks, many small benchmarks will be seen as static code within a loop that accomplishes nothing. At that point, the benchmark harness is simply broken, because it's allowing the results of one iteration to affect subsequent iterations.

What we actually want here is to improve the scope of what's being measured...not allow the harness itself to optimize away.

The cost of running the loop should be excluded.
The cost of dispatching to the block should be excluded.

enebo commented 4 years ago

In the more generic discussion of optimizing the whole bench away...Is there a way to:

print it out as invalid (so time minimal work and time actual work and if close point out it is not relevant result)
put some effect into the code which guarantees the work cannot completely optmize away

Personally, I like the notion of the former. If you run something which is basically 0 time you are doing it wrong (as a regular programmer anyways).

chrisseaton commented 4 years ago

print it out as invalid (so time minimal work and time actual work and if close point out it is not relevant result)

I think it's possible - TruffleRuby is able to tell you that an expression has been constant folded. That could be exposed as an API or something?

put some effect into the code which guarantees the work cannot completely optmize away

I don't think so - because you add that effect but then later optimisations remove it again.

For example:

Can Truffle constant fold yet?
> let $zero = File.open('/dev/zero')
> $zero.readbyte.zero?
Yes! Truffle can constant fold this to true

So yes it can tell you it's a constant. No, you can't then stop it doing that without turning off most of the compiler.

enebo commented 4 years ago

@chrisseaton I was thinking of getting current time but I am not suggesting that...can TR really work around something like that?

chrisseaton commented 4 years ago

How would you pick a time value threshold though? Would be very hard. Maybe instruction counts if you know how many instructions your loop takes? But I think in practice things are usually partially optimised away and guards remain.

Yes that's a real interactive session - a counterexample:

> let $random = File.open('/dev/random')
> $random.readbyte.zero?
No :( Truffle can't constant fold that

headius commented 4 years ago

print it out as invalid

+1 this would be useful. Clearly if the IPS gets into the billions for a ten-second run, the benchmark is no longer accurately reflecting anything. At that point, either the benchmark and harness have been optimized away, or the loop + dispatch dominates.

I mean really...the TR "after" results above are 6B iterations in 5 seconds, or under 1ns. That's roughly running an iteration per CPU cycle. Is it meaningful if your goal is to benchmark String#== versus method_missing as in this case?

headius commented 4 years ago

@chrisseaton I am confused about your examples, and how they're relevant here. Surely TR can't prove that /dev/zero will always produce 0, so I'm confused what is being folded here.

chrisseaton commented 4 years ago

Surely TR can't prove that /dev/zero will always produce 0

It doesn't prove it - it speculates it. What I mean is if you try to stop your benchmark optimising away by using a dynamic value from a system call in the benchmark, TruffleRuby will make the system call, but will then optimise the rest of the benchmark with the value as a constant. So you'll get the benchmark optimised away and then just the system call remaining.

If you're benchmarking + and you do zero.readbyte + zero.readbyte you'll get the constant value 0 and you won't be benchmarking + any more because + will have been optimised away.

headius commented 4 years ago

@chrisseaton I think I get your point about dynamic optimizations, but the readbyte call can't be eliminated; the IO stream may eventually return non-zero, so even if the zero? call speculates that it's being called on 0, the read must still happen and someone must validate that it returned zero. That's where my confusion came from...there's clearly non-constant parts of that line of code that can't be eliminate.

Your point about + is clearer to me; since it aggressively speculates that those calls will continue returning 0, it can eliminate the actual + logic.

I'm not sure exactly how it's relevant here, but perhaps introducing some noise into the harness would help keep it from optimizing away.

chrisseaton commented 4 years ago

I'm not sure exactly how it's relevant here

Sorry yes it's not directly relevant to this issue - I was saying why we can't forget about this issue and instead easily add some kind of primitives to force values to be dynamic and always used, so they're always run even if in a loop.

pitr-ch commented 4 years ago

Let me share my point of view. I hope it helps to make things a bit clearer not more complicated.

Without this PR the benchmark results are dependent on the number of code blocks measured since the more measurements are added the more polymorphic the compiled code is. I think that is a clear reason for this PR to be merged, since we always want a measurement to be independent of the other measurements in the benchmark (or in this case their number).

Without this PR the user could get the radically different result where it optimised away if the env == 'development' was measured independently. A benchmarking harness should not report different numbers for independent measurements.

The previous behaviour was sometimes non-deterministically preventing inlining. However, there were no intentional steps taken to prevent the inlining. The user could always get the results for fully inlined blocks by running them separately, which is just confusing. As I see it this PR does not change anything in the aspect of inlining but fixes the non-determinism.

In this particular example the env == 'development' optimised away, in other cases it would just be faster thanks to different level of inlining.

The point @headius is making is relevant, however, I think It should be looked at separately.

Note: This does not favour TruffleRuby, you could already get the better numbers by running the measured blocks separately. This fixes the issue in the harness, which so far manifests only on TruffleRuby because of the inline cache going polymorphic. That is a common optimisation technique, nothing special from TruffleRuby. Other Ruby implementations just do not have it yet.

eregon commented 4 years ago

I think @pitr-ch summarizes this well. As the description of this PR explains and demonstrates, currently benchmark-ips reports inaccurate results (by more than 1 order of magnitude here) depending on whether there is one report block (all fine) or multiple (the problem) when running on a Ruby implementation which supports inlining block calls. This PR solves that, in a straightforward way.

The problem is multiple report blocks cause polymorphism where the block is called without this PR, which causes report blocks to affect each other. Of course, report blocks should be as independent as possible, and it's already the case when using report(String), report { |times| i=0; while i<times; ... } but not for report { ... }. So let's fix that. In real code, especially when using x.compare! only one of the approaches would be used. So the benchmark harness should reflect that performance, not a mix of both blocks in the innermost loop.

Polymorphism there is actually problematic for other reasons too such as if there is a polymorphic inline cache for block calls (TruffleRuby has one), it might cause some blocks to be inlined and some blocks to not be, which would be very unfair to the report blocks not inlined.

There are two ways to solve this.

This PR makes all block calls monomorphic, so report { ... } becomes like the 2 other report forms. Consistent and simple.
Another way would be to make all block calls megamorphic, but that is:
- Impossible for the other 2 report forms
- Inconsistent with the other 2 report forms which can only be monomorphic
- Would add a significant overhead in the innermost loop, preventing to measure accurately fast blocks.

So I don't think that second approach is reasonable.

There might be other ways by adding some intrinsic in all Ruby implementations to prevent inlining but that is out of scope of this PR. It would still have the drawbacks mentioned above (inconsistent with other report forms), just maybe slightly lower overhead.

I'm sure there are many other possible improvements to benchmark-ips. There are IMHO out of scope of this PR. If someone wants to provide an alternative solution they should feel free to open another PR, and we can always refine later obviously.

But for now, I think this is the correct, right and straightforward solution to the problem I describe. @evanphx Could you review this PR?

eregon commented 4 years ago

@enebo wrote:

8fcafdc matches the intent of what benchmark-ips advertises. Each measure {} or entry of work should not share the same physical code and turn into something which partially measures polymorphic dispatch.

Completely agreed.

750a660 seems like you are working around a design decision. I say you may have to reconsider sharing of evals internally.

It's to be sure there is a copy per report block. TruffleRuby inline caches eval (if it's the same (==) code String) but not class_eval currently. However, it seems other implementations or TruffleRuby in the future might inline cache class_eval, so this is a way to make this future-proof. If @evanphx objects on this I can remove that commit. But it seems safer to leave it in to ensure we achieve a copy per report block, no matter what gets cached.

headius commented 4 years ago

I don't have any standing objections to making this change.

My main concern is that it will now be easier for some implementations to essentially optimize the entire benchmark away, giving people a skewed view of performance across implementations. This aggravates that problem by making it happen more easily and more consistently on implementations that can inline blocks.

In response to this work, I've started making it possible for JRuby to inline blocks, so we're edging toward having the same problem.

So the problem is a general one not directly related to 8fcafdc. I will concede it is at least more consistent, even if that consistency means there's a better chance for benchmarks to report useless results.

However I agree with @enebo that 750a660 is very much a hack around the way TruffleRuby optimizes evals.

In JRuby, we have experimented with caching evals in the past, and never put it into production for exactly this reason. Purely static evals are almost nonexistent in real-world code. Nearly all evals have different string content, and those few that use the exact same string content are evaluated in different contexts, making them dynamic in other ways (such as here, where different call targets are encountered).

I believe 750a660 should be pulled out and discussed separately.

chrisseaton commented 4 years ago

Purely static evals are almost nonexistent in real-world code.

Isn't this how ERB works by default? Unless by static you mean literally a string literal as the argument? The value of the parameter @src is in practice static enough to treat it as static in my experience.

https://github.com/ruby/ruby/blob/b0bf654c3164e9c6b5b38aa6fe23bd76d28a38d2/lib/erb.rb#L905

headius commented 4 years ago

@chrisseaton The ERB mode that evaluates the string every time is almost never used by any actual production code for obvious reasons. Instead, that code is usually evaluated into a method body and the method is called.

By "static" eval, I mean an eval that will produce the same result every time. That would require not only executing the same string but evaluating it in the same (or very similar) scope/frame/context with the same values. That case is fairly limited, perhaps only happening in metaprogramming cases that ultimately are only run once anyway.

It's certainly TruffleRuby's prerogative to aggressively cache same-string evals. It's a separate debate as to whether it's actually useful in real-world code. But I don't think libraries like benchmark-ips should include hacks like 750a660 just to work around deficiencies in the way TruffleRuby caches and optimizes those evals.

headius commented 4 years ago

As a counterpoint: the way ERB users avoid re-evaluating the same ERB string is to evaluate that content into a method body. I'd be more comfortable modifying the logic in benchmark-ips to do the same thing, rather than adding otherwise useless code to the eval just to trick TruffleRuby out of caching it.

eregon commented 4 years ago

@headius as I wrote above, TruffleRuby does not cache class_eval. But it's any Ruby implementation's right to cache whatever they like. I'd like to make this code future-proof.

As said before, I'll let @evanphx review that part. Please stop digressing in unrelated discussions in this PR.

headius commented 4 years ago

But it's any Ruby implementation's right to cache whatever they like.

It certainly is. But this modification is included because TruffleRuby has trouble with the polymorphic calls resulting from your cached eval, correct? Other implementations may not cache eval the same way (if they even do so in the future), so this hack only works around a deficiency in TruffleRuby's eval caching.

Please stop digressing in unrelated discussions in this PR.

I was replying to @chrisseaton's comment about ERB.

headius commented 4 years ago

TruffleRuby does not cache class_eval

I did read that, and I never mentioned class_eval.

eregon commented 4 years ago

Since that commit seems contentious and not strictly needed right now for well-known Ruby implementations, I removed it. The first commit is far more important.

In the future, if some implementation ever caches class_eval, we should probably include it again. The commit is saved in https://github.com/eregon/benchmark-ips/commit/750a660619146394df64d84a69648e34861df29b.

headius commented 4 years ago

@eregon I'd be fine having a discussion about eregon@750a660 in a separate PR.

It is indeed true that non-TruffleRuby impls may eventually choose to cache various forms of eval, and so there may be a valid need for an eval-cache-defeating tweak. That also seems to fit into the more general discussion about how to make benchmark-ips more accurate across all Ruby impls, some of which can actually optimize benchmark-ips's code away.

I just wanted to move that discussion out of this PR so that the original change could be merged. Thank you for pulling it out.

headius commented 4 years ago

After some offline discussion with @eregon I understand now that the eval change actually doesn't help any implementation, since it's already using class_eval and no implementation (including TruffleRuby) caches class_eval stuff currently.

So that reinforces my believe that it's best to pull it out of this PR, which clearly does impact current implementations and which I agree is at least good for consistency in the tool.

eregon commented 4 years ago

Rebased on master to run the CI.

nateberkopec commented 4 years ago

@eregon It looks like you pulled https://github.com/evanphx/benchmark-ips/commit/750a660619146394df64d84a69648e34861df29b from this PR, right?

eregon commented 4 years ago

Yes, I removed it as it was contentious and not needed yet for MRI/JRuby/TruffleRuby.

evanphx / benchmark-ips

Compile a #call_times method for each Job::Entry instance #95