Optimization targets for `optcarrot`

maximecb commented 1 year ago

I think it could be fun/interesting to work on making optcarrot run a little faster, so I did some profiling of optcarrot with Kokubun's dynamic send patch.

Method used:

ruby -I harness-continuous --yjit benchmarks/optcarrot/benchmark.rb
sudo renice -20 -p 25011
sample 25011 30 -f output.prof
filtercalltree output.prof

Samples most present at top of stack:

Sort by top of stack, same collapsed (when >= 5):
        poll  (in libsystem_kernel.dylib)        25475
        rb_yarv_ary_entry_internal  (in ruby)        1030
        rb_ary_rotate  (in ruby)        904
        vm_exec_core  (in ruby)        791
        ary_ensure_room_for_push  (in ruby)        596
        rb_ary_push  (in ruby)        482
        rb_vm_exec  (in ruby)        356
        vm_call_symbol  (in ruby)        295
        ???  (in <unknown binary>)  [0x105d44b04]        266
        rb_vm_opt_send_without_block  (in ruby)        265
        ???  (in <unknown binary>)  [0x105d44cf8]        235
        _setjmp  (in libsystem_platform.dylib)        218
        invoke_block_from_c_bh  (in ruby)        214
        rb_yjit_fix_mod_fix  (in ruby)        203
        ???  (in <unknown binary>)  [0x105d75234]        185
        ???  (in <unknown binary>)  [0x105d78584]        173
        CALLER_SETUP_ARG  (in ruby)        167
        ???  (in <unknown binary>)  [0x105d882e4]        164
        vm_call_iseq_setup  (in ruby)        163
        vm_caller_setup_arg_splat  (in ruby)        163
        rb_ary_splice  (in ruby)        159
        rb_ary_aset  (in ruby)        155

The most surprising thing is that poll is at the top (???). That makes me wonder if the benchmark is doing some kind of I/O (for graphical output?).

A potentially easy optimization target is rb_yjit_fix_mod_fix. I've looked at it before. The logic is tricky, but we could potentially implement a fast path where the divisor is > 0.

Besides that, there is a fair amount of array operations. We're still calling rb_ary_entry_internal in opt_aref. IIRC Jimmy tried to optimize this but couldn't get it to run any faster in YJIT because the logic is fairly convoluted. There might be a way that we can simplify things a bit on the CRuby side. We could also try speculating that the array is embedded or heap using dispatch chains for some sites. The good thing about array optimization targets is that they should also translate elsewhere.

Otherwise, there is not a lot to go on for optcarrot, except in that everything we do to generally optimize the code (e.g. faster calls) will also help optcarrrot.

maximecb commented 1 year ago

Looking at the disassembly, I see a lot of these for opt_getconstant_path:

  # gen_direct_jmp: fallthrough
  # Block: block in op@/Users/maximecb/src/github.com/Shopify/yjit-bench/benchmarks/optcarrot/lib/optcarrot/cpu.rb:959 
  # reg_temps: 00000001
  # Insn: 0003 opt_getconstant_path (stack_size: 1)
  # reg_temps: 00000001 -> 00000011
  0x111e29033: jmp 0x111e2b04b

@jhawthorn This is presumably because we do code patching for getconstant. If we could directly initialize constants in the JIT and avoid having to do invalidation and code patching, we could generate more linear code, which might be more cache-efficient and perform slightly better?

Seems we could also have a better fast path for opt_mult:

  # Insn: 0122 opt_mult (stack_size: 2)
  # call to Integer#*
  # save PC to CFP
  0x111eb56be: movabs rax, 0x7ff3e804c1e0
  0x111eb56c8: mov qword ptr [r13], rax
  # spill_temps: 00000011 -> 00000000
  0x111eb56cc: mov qword ptr [rbx - 0x18], rsi
  0x111eb56d0: mov qword ptr [rbx - 0x10], rdi
  # save SP to CFP
  0x111eb56d4: lea rbx, [rbx - 8]
  0x111eb56d8: mov qword ptr [r13 + 8], rbx
  # Integer#*
  0x111eb56dc: mov rdi, qword ptr [rbx - 0x10]
  0x111eb56e0: mov rsi, qword ptr [rbx - 8]
  0x111eb56e4: call 0x104ebe7e0
  # reg_temps: 00000000 -> 00000001
  0x111eb56e9: mov rsi, rax

Otherwise seeing a lot of RUBY_VM_CHECK_INTS(ec) when calling and returning. If we could avoid doing those for leaf methods or through a clever trick, it could make our code significantly more compact.

maximecb commented 1 year ago

Closing for now since optcarrot is already over 3x faster and likely to improve more when we land frame outlining.

Shopify / ruby

Optimization targets for `optcarrot` #527