Open tenderlove opened 1 month ago
Might make sense to look at the disasm and make sure that your optimization is getting applied, see what the generated code looks like.
I'm a bit surprised that unrolling doesn't help YJIT at all. Wondering if there is a big source of overhead somewhere else or something?
We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.
I tried 3 different ways of unrolling the loop:
Array#pack
optimizationThe code is below:
Click me for benchmark
```ruby # frozen_string_literal: true require "benchmark/ips" def uint64_enc1 x, buff while x != 0 byte = x & 0x7F x >>= 7 byte |= 0x80 if x > 0 buff << byte end end def uint64_enc2 x, buff if (x > 0) byte = (x >> 0) & 0x7F if (x >> 7) > 0 buff << (byte | 0x80) byte = (x >> 7) & 0x7F if (x >> 14) > 0 buff << (byte | 0x80) byte = (x >> 14) & 0x7F if (x >> 21) > 0 buff << (byte | 0x80) byte = (x >> 21) & 0x7F if (x >> 28) > 0 buff << (byte | 0x80) byte = (x >> 28) & 0x7F if (x >> 35) > 0 buff << (byte | 0x80) byte = (x >> 35) & 0x7F if (x >> 42) > 0 buff << (byte | 0x80) byte = (x >> 42) & 0x7F if (x >> 49) > 0 buff << (byte | 0x80) byte = (x >> 49) & 0x7F if (x >> 56) > 0 buff << (byte | 0x80) byte = (x >> 56) & 0x7F if (x >> 63) > 0 buff << (byte | 0x80) buff << 1 else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end end end def uint64_enc3 x, buff if (x > 0) if (x >> 7) > 0 buff << (((x >> 0) & 0x7F) | 0x80) if (x >> 14) > 0 buff << (((x >> 7) & 0x7F) | 0x80) if (x >> 21) > 0 buff << (((x >> 14) & 0x7F) | 0x80) if (x >> 28) > 0 buff << (((x >> 21) & 0x7F) | 0x80) if (x >> 35) > 0 buff << (((x >> 28) & 0x7F) | 0x80) if (x >> 42) > 0 buff << (((x >> 35) & 0x7F) | 0x80) if (x >> 49) > 0 buff << (((x >> 42) & 0x7F) | 0x80) if (x >> 56) > 0 buff << (((x >> 49) & 0x7F) | 0x80) if (x >> 63) > 0 buff << (((x >> 56) & 0x7F) | 0x80) buff << 1 else buff << (x >> 56) end else buff << (x >> 49) end else buff << (x >> 42) end else buff << (x >> 35) end else buff << (x >> 28) end else buff << (x >> 21) end else buff << (x >> 14) end else buff << (x >> 7) end else buff << x end end end def uint64_enc4 x, buff if (x > 0) if (x >> 7) > 0 byte0 = (((x >> 0) & 0x7F) | 0x80) if (x >> 14) > 0 byte1 = (((x >> 7) & 0x7F) | 0x80) if (x >> 21) > 0 byte2 = (((x >> 14) & 0x7F) | 0x80) if (x >> 28) > 0 byte3 = (((x >> 21) & 0x7F) | 0x80) if (x >> 35) > 0 byte4 = (((x >> 28) & 0x7F) | 0x80) if (x >> 42) > 0 byte5 = (((x >> 35) & 0x7F) | 0x80) if (x >> 49) > 0 byte6 = (((x >> 42) & 0x7F) | 0x80) if (x >> 56) > 0 byte7 = (((x >> 49) & 0x7F) | 0x80) if (x >> 63) > 0 byte8 = (((x >> 56) & 0x7F) | 0x80) buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC") else buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC") end else buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC") end else buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC") end else buff << [byte0, byte1, (x >> 14)].pack("CCC") end else buff << [byte0, (x >> 7)].pack("CC") end else buff << [(x >> 0)].pack("C") end end end Benchmark.ips do |x| x.report("uint64_enc1 (loop)") { uint64_enc1(0xFFFF_FFFF, "".b) } x.report("uint64_enc2 (unroll)") { uint64_enc2(0xFFFF_FFFF, "".b) } x.report("uint64_enc3 (unroll better)") { uint64_enc3(0xFFFF_FFFF, "".b) } x.report("uint64_enc4 (Array#pack)") { uint64_enc4(0xFFFF_FFFF, "".b) } end ```Results are below.
Interpreter:
YJIT:
Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT. I'm a little bit surprised by how slow the
Array#pack
solution is on YJIT, I expected it to fare better since I merged https://github.com/ruby/ruby/pull/8734