Closed maximecb closed 1 year ago
Looking at the disassembly, I see a lot of these for opt_getconstant_path
:
# gen_direct_jmp: fallthrough
# Block: block in op@/Users/maximecb/src/github.com/Shopify/yjit-bench/benchmarks/optcarrot/lib/optcarrot/cpu.rb:959
# reg_temps: 00000001
# Insn: 0003 opt_getconstant_path (stack_size: 1)
# reg_temps: 00000001 -> 00000011
0x111e29033: jmp 0x111e2b04b
@jhawthorn This is presumably because we do code patching for getconstant. If we could directly initialize constants in the JIT and avoid having to do invalidation and code patching, we could generate more linear code, which might be more cache-efficient and perform slightly better?
Seems we could also have a better fast path for opt_mult
:
# Insn: 0122 opt_mult (stack_size: 2)
# call to Integer#*
# save PC to CFP
0x111eb56be: movabs rax, 0x7ff3e804c1e0
0x111eb56c8: mov qword ptr [r13], rax
# spill_temps: 00000011 -> 00000000
0x111eb56cc: mov qword ptr [rbx - 0x18], rsi
0x111eb56d0: mov qword ptr [rbx - 0x10], rdi
# save SP to CFP
0x111eb56d4: lea rbx, [rbx - 8]
0x111eb56d8: mov qword ptr [r13 + 8], rbx
# Integer#*
0x111eb56dc: mov rdi, qword ptr [rbx - 0x10]
0x111eb56e0: mov rsi, qword ptr [rbx - 8]
0x111eb56e4: call 0x104ebe7e0
# reg_temps: 00000000 -> 00000001
0x111eb56e9: mov rsi, rax
Otherwise seeing a lot of RUBY_VM_CHECK_INTS(ec)
when calling and returning. If we could avoid doing those for leaf methods or through a clever trick, it could make our code significantly more compact.
Closing for now since optcarrot is already over 3x faster and likely to improve more when we land frame outlining.
I think it could be fun/interesting to work on making optcarrot run a little faster, so I did some profiling of optcarrot with Kokubun's dynamic send patch.
Method used:
Samples most present at top of stack:
The most surprising thing is that poll is at the top (???). That makes me wonder if the benchmark is doing some kind of I/O (for graphical output?).
A potentially easy optimization target is
rb_yjit_fix_mod_fix
. I've looked at it before. The logic is tricky, but we could potentially implement a fast path where the divisor is > 0.Besides that, there is a fair amount of array operations. We're still calling
rb_ary_entry_internal
inopt_aref
. IIRC Jimmy tried to optimize this but couldn't get it to run any faster in YJIT because the logic is fairly convoluted. There might be a way that we can simplify things a bit on the CRuby side. We could also try speculating that the array is embedded or heap using dispatch chains for some sites. The good thing about array optimization targets is that they should also translate elsewhere.Otherwise, there is not a lot to go on for optcarrot, except in that everything we do to generally optimize the code (e.g. faster calls) will also help optcarrrot.