Unrolling uint64 encoding

We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.

I tried 3 different ways of unrolling the loop:

Just a naive way translating the while loop to if statements
Same as 1, but eliminating intermediate local variable writes
Rather than appending to the buffer for each byte, use the Array#pack optimization

The code is below:

Click me for benchmark

```ruby # frozen_string_literal: true require "benchmark/ips" def uint64_enc1 x, buff while x != 0 byte = x & 0x7F x >>= 7 byte |= 0x80 if x > 0 buff << byte end end def uint64_enc2 x, buff if (x > 0) byte = (x >> 0) & 0x7F if (x >> 7) > 0 buff << (byte | 0x80) byte = (x >> 7) & 0x7F if (x >> 14) > 0 buff << (byte | 0x80) byte = (x >> 14) & 0x7F if (x >> 21) > 0 buff << (byte | 0x80) byte = (x >> 21) & 0x7F if (x >> 28) > 0 buff << (byte | 0x80) byte = (x >> 28) & 0x7F if (x >> 35) > 0 buff << (byte | 0x80) byte = (x >> 35) & 0x7F if (x >> 42) > 0 buff << (byte | 0x80) byte = (x >> 42) & 0x7F if (x >> 49) > 0 buff << (byte | 0x80) byte = (x >> 49) & 0x7F if (x >> 56) > 0 buff << (byte | 0x80) byte = (x >> 56) & 0x7F if (x >> 63) > 0 buff << (byte | 0x80) buff << 1 else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end end end def uint64_enc3 x, buff if (x > 0) if (x >> 7) > 0 buff << (((x >> 0) & 0x7F) | 0x80) if (x >> 14) > 0 buff << (((x >> 7) & 0x7F) | 0x80) if (x >> 21) > 0 buff << (((x >> 14) & 0x7F) | 0x80) if (x >> 28) > 0 buff << (((x >> 21) & 0x7F) | 0x80) if (x >> 35) > 0 buff << (((x >> 28) & 0x7F) | 0x80) if (x >> 42) > 0 buff << (((x >> 35) & 0x7F) | 0x80) if (x >> 49) > 0 buff << (((x >> 42) & 0x7F) | 0x80) if (x >> 56) > 0 buff << (((x >> 49) & 0x7F) | 0x80) if (x >> 63) > 0 buff << (((x >> 56) & 0x7F) | 0x80) buff << 1 else buff << (x >> 56) end else buff << (x >> 49) end else buff << (x >> 42) end else buff << (x >> 35) end else buff << (x >> 28) end else buff << (x >> 21) end else buff << (x >> 14) end else buff << (x >> 7) end else buff << x end end end def uint64_enc4 x, buff if (x > 0) if (x >> 7) > 0 byte0 = (((x >> 0) & 0x7F) | 0x80) if (x >> 14) > 0 byte1 = (((x >> 7) & 0x7F) | 0x80) if (x >> 21) > 0 byte2 = (((x >> 14) & 0x7F) | 0x80) if (x >> 28) > 0 byte3 = (((x >> 21) & 0x7F) | 0x80) if (x >> 35) > 0 byte4 = (((x >> 28) & 0x7F) | 0x80) if (x >> 42) > 0 byte5 = (((x >> 35) & 0x7F) | 0x80) if (x >> 49) > 0 byte6 = (((x >> 42) & 0x7F) | 0x80) if (x >> 56) > 0 byte7 = (((x >> 49) & 0x7F) | 0x80) if (x >> 63) > 0 byte8 = (((x >> 56) & 0x7F) | 0x80) buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC") else buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC") end else buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC") end else buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC") end else buff << [byte0, byte1, (x >> 14)].pack("CCC") end else buff << [byte0, (x >> 7)].pack("CC") end else buff << [(x >> 0)].pack("C") end end end Benchmark.ips do |x| x.report("uint64_enc1 (loop)") { uint64_enc1(0xFFFF_FFFF, "".b) } x.report("uint64_enc2 (unroll)") { uint64_enc2(0xFFFF_FFFF, "".b) } x.report("uint64_enc3 (unroll better)") { uint64_enc3(0xFFFF_FFFF, "".b) } x.report("uint64_enc4 (Array#pack)") { uint64_enc4(0xFFFF_FFFF, "".b) } end ```

Results are below.

Interpreter:

$ ruby test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)   345.396k i/100ms
uint64_enc2 (unroll)   380.889k i/100ms
uint64_enc3 (unroll better)
                       415.911k i/100ms
uint64_enc4 (Array#pack)
                       319.319k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)      3.413M (± 2.4%) i/s -     17.270M in   5.063010s
uint64_enc2 (unroll)      3.743M (± 1.2%) i/s -     19.044M in   5.088180s
uint64_enc3 (unroll better)
                          4.119M (± 0.6%) i/s -     20.796M in   5.048463s
uint64_enc4 (Array#pack)
                          3.144M (± 0.7%) i/s -     15.966M in   5.078173s

YJIT:

$ ruby --yjit test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) +YJIT [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)     1.067M i/100ms
uint64_enc2 (unroll)     1.095M i/100ms
uint64_enc3 (unroll better)
                         1.089M i/100ms
uint64_enc4 (Array#pack)
                       640.782k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)     11.533M (± 0.5%) i/s -     58.679M in   5.087999s
uint64_enc2 (unroll)     11.700M (± 0.6%) i/s -     59.114M in   5.052717s
uint64_enc3 (unroll better)
                         11.760M (± 0.5%) i/s -     59.869M in   5.090842s
uint64_enc4 (Array#pack)
                          6.785M (± 1.4%) i/s -     33.961M in   5.006533s

Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT. I'm a little bit surprised by how slow the Array#pack solution is on YJIT, I expected it to fare better since I merged https://github.com/ruby/ruby/pull/8734

Shopify / protoboeuf

Unrolling uint64 encoding #97