Unblock inlining and reduce interp_return by rewriting C methods in Ruby

k0kubun commented 1 year ago

Background

In CRuby, many methods are written in C. This contributes to the warmup speed and the performance of the interpreter, but this could sometimes be a problem for YJIT:

Having a C frame in between Ruby frames prevents inlining (apply callsite-specific optimization across different methods, and possibly skip pushing/popping a CFP in inline code)
Not only it incurs a setjmp cost on vm_exec for every Ruby call from C, but it also causes interp_return exits on leave insn.

Idea

Rewrite C methods that call Ruby methods/blocks frequently in Ruby if that makes YJIT faster and doesn't slow down the interpreter too much in headline benchmarks. Past successful examples: https://github.com/ruby/ruby/pull/3281, https://github.com/ruby/ruby/pull/6983

Ones with large "block calls from C" in https://github.com/Shopify/yjit-bench/pull/168, e.g. Array#each, might be promising targets.

k0kubun commented 1 year ago

Array#each

When I started working on this, rewriting Array#each resulted in increasing the number of side exits and also breaking "special variables". The following PRs are related to those issues:

Open PRs:
- Rewrite Array#each in Ruby
- YJIT: Consider a block ISEQ in block versioning
Closed PRs

The Ruby PR currently uses succ for better interpreter performance (and as a workaround for an Integer#+ override test). But since we don't have a cfunc codegen for succ yet, I'm benchmarking it with i = i.succ rewritten in i += 1.

microbenchmark

# --yjit-call-threshold=1
def bench(arr)
  i = 0
  arr.each do |a|
    i += a
  end
  i
end

bench([1, 1]) # JIT
Time.now # JIT

arr = Array.new(50000000)
arr.map! { 1 }

t = Time.new
bench(arr)
p(Time.new - t)

before: 1.211382153
after: 0.625836132

before/after is 1.94.

railsbench

master: ruby 3.3.0dev (2023-02-13T16:42:58Z master 86de48e9f6) +YJIT [x86_64-linux]
after: ruby 3.3.0dev (2023-02-13T21:49:12Z builtin-array-each 224c1f2f17) +YJIT [x86_64-linux]

----------  -----------  ----------  ----------  ----------  ------------  -------------
bench       master (ms)  stddev (%)  after (ms)  stddev (%)  master/after  after 1st itr
railsbench  1042.2       1.8         1046.9      1.7         1.00          0.99
----------  -----------  ----------  ----------  ----------  ------------  -------------

--yjit-stats before

master (https://github.com/shopify/ruby/commit/86de48e9f6)

invokeblock exit reasons:
      proc:      1,112 (59.4%)
    symbol:        759 (40.6%)

leave exit reasons:
        interp_return: 39,982,816 (98.0%)
    start_pc_non_zero:    822,565 ( 2.0%)
         se_interrupt:        400 ( 0.0%)

side_exit_count:        9,811,317
total_exit_count:      49,794,133
ratio_in_yjit:              93.1%

--yjit-stats after

Rewrite Array#each in Ruby + YJIT: Consider a block ISEQ in block versioning (https://github.com/Shopify/ruby/commit/224c1f2f17 w/ i += 1 modification)

invokeblock exit reasons:
    iseq_block_changed:    152,448 (98.7%)
                  proc:      1,264 ( 0.8%)
                symbol:        759 ( 0.5%)

leave exit reasons:
        interp_return: 38,833,171 (97.9%)
    start_pc_non_zero:    822,559 ( 2.1%)
         se_interrupt:        387 ( 0.0%)

side_exit_count:        9,914,001
total_exit_count:      48,747,172
ratio_in_yjit:              92.7%

stats without "YJIT: Consider a block ISEQ in block versioning"

``` invokeblock exit reasons: iseq_block_changed: 2,158,983 (99.9%) proc: 1,264 ( 0.1%) symbol: 759 ( 0.0%) leave exit reasons: interp_return: 40,690,343 (98.0%) start_pc_non_zero: 822,559 ( 2.0%) se_interrupt: 424 ( 0.0%) side_exit_count: 11,737,658 total_exit_count: 52,428,001 ratio_in_yjit: 91.2% ```

If you compare "--yjit-stats after" and "stats without YJIT: Consider a block ISEQ in block versioning", you can see https://github.com/ruby/ruby/pull/7247 contributes to significantly reducing iseq_block_changed side exits. However, it still leaves some iseq_block_changed side exits.

This reduces the amount of interp_return by 3%. Despite iseq_block_changed side exits, total_exit_count is reduced by 2% overall.

While we could still work on reducing iseq_block_changed further, it already seems like it's not going to make a significant difference in benchmarks like railsbench. The Ruby rewrite might not be useful enough until we have more optimizations unblocked by inlining, especially when liquid-render slows down on the interpreter (https://github.com/ruby/ruby/pull/6687).

cc: @maximecb

maximecb commented 1 year ago

Just to make sure I understand: you added a block iseq parameter to the context. This keeps track of the ISEQ of the block parameter to the function (if known). What is the iseq_block_changed exit? Is this related to methods being called with a different block than expected?

With respect to versioning on the input block ISEQ, I feel like maybe this isn't ideal. It seems like we're pushing versioning information from the caller into the callee in order to specialize it, but functions like Array#each have potentially a lot of callers, and so we have to special-case the versioning logic to handle that.

Ultimately, our goal is to inline Array#each into the caller. That would mean bringing blocks from the callee into the caller, which would only affect the version limit of the caller ISEQ. We don't have the capability to do real inlining yet, but maybe we can implement the logic to do pseudo-inlining, where we specialize the callee inside the caller, but we still push/pop frames?

k0kubun commented 1 year ago

you added a block iseq parameter to the context. This keeps track of the ISEQ of the block parameter to the function (if known). What is the iseq_block_changed exit? Is this related to methods being called with a different block than expected?

Correct. I haven't investigated when iseq_block_changed exits are taken with the current patch, but I suspect block_iseq is not set in those cases for some reason.

That would mean bringing blocks from the callee into the caller, which would only affect the version limit of the caller ISEQ.

I agree that it's better to associate the versioning limit to the caller of Array#each. I'll keep it in mind when I work on implementing a versioning limit to it.

Shopify / ruby