Shopify / protoboeuf

Experimenting with a protobuf implementation
MIT License
23 stars 2 forks source link

Unrolling uint64 encoding #97

Open tenderlove opened 1 month ago

tenderlove commented 1 month ago

We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.

I tried 3 different ways of unrolling the loop:

  1. Just a naive way translating the while loop to if statements
  2. Same as 1, but eliminating intermediate local variable writes
  3. Rather than appending to the buffer for each byte, use the Array#pack optimization

The code is below:

Click me for benchmark ```ruby # frozen_string_literal: true require "benchmark/ips" def uint64_enc1 x, buff while x != 0 byte = x & 0x7F x >>= 7 byte |= 0x80 if x > 0 buff << byte end end def uint64_enc2 x, buff if (x > 0) byte = (x >> 0) & 0x7F if (x >> 7) > 0 buff << (byte | 0x80) byte = (x >> 7) & 0x7F if (x >> 14) > 0 buff << (byte | 0x80) byte = (x >> 14) & 0x7F if (x >> 21) > 0 buff << (byte | 0x80) byte = (x >> 21) & 0x7F if (x >> 28) > 0 buff << (byte | 0x80) byte = (x >> 28) & 0x7F if (x >> 35) > 0 buff << (byte | 0x80) byte = (x >> 35) & 0x7F if (x >> 42) > 0 buff << (byte | 0x80) byte = (x >> 42) & 0x7F if (x >> 49) > 0 buff << (byte | 0x80) byte = (x >> 49) & 0x7F if (x >> 56) > 0 buff << (byte | 0x80) byte = (x >> 56) & 0x7F if (x >> 63) > 0 buff << (byte | 0x80) buff << 1 else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end else buff << byte end end end def uint64_enc3 x, buff if (x > 0) if (x >> 7) > 0 buff << (((x >> 0) & 0x7F) | 0x80) if (x >> 14) > 0 buff << (((x >> 7) & 0x7F) | 0x80) if (x >> 21) > 0 buff << (((x >> 14) & 0x7F) | 0x80) if (x >> 28) > 0 buff << (((x >> 21) & 0x7F) | 0x80) if (x >> 35) > 0 buff << (((x >> 28) & 0x7F) | 0x80) if (x >> 42) > 0 buff << (((x >> 35) & 0x7F) | 0x80) if (x >> 49) > 0 buff << (((x >> 42) & 0x7F) | 0x80) if (x >> 56) > 0 buff << (((x >> 49) & 0x7F) | 0x80) if (x >> 63) > 0 buff << (((x >> 56) & 0x7F) | 0x80) buff << 1 else buff << (x >> 56) end else buff << (x >> 49) end else buff << (x >> 42) end else buff << (x >> 35) end else buff << (x >> 28) end else buff << (x >> 21) end else buff << (x >> 14) end else buff << (x >> 7) end else buff << x end end end def uint64_enc4 x, buff if (x > 0) if (x >> 7) > 0 byte0 = (((x >> 0) & 0x7F) | 0x80) if (x >> 14) > 0 byte1 = (((x >> 7) & 0x7F) | 0x80) if (x >> 21) > 0 byte2 = (((x >> 14) & 0x7F) | 0x80) if (x >> 28) > 0 byte3 = (((x >> 21) & 0x7F) | 0x80) if (x >> 35) > 0 byte4 = (((x >> 28) & 0x7F) | 0x80) if (x >> 42) > 0 byte5 = (((x >> 35) & 0x7F) | 0x80) if (x >> 49) > 0 byte6 = (((x >> 42) & 0x7F) | 0x80) if (x >> 56) > 0 byte7 = (((x >> 49) & 0x7F) | 0x80) if (x >> 63) > 0 byte8 = (((x >> 56) & 0x7F) | 0x80) buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC") else buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC") end else buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC") end else buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC") end else buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC") end else buff << [byte0, byte1, (x >> 14)].pack("CCC") end else buff << [byte0, (x >> 7)].pack("CC") end else buff << [(x >> 0)].pack("C") end end end Benchmark.ips do |x| x.report("uint64_enc1 (loop)") { uint64_enc1(0xFFFF_FFFF, "".b) } x.report("uint64_enc2 (unroll)") { uint64_enc2(0xFFFF_FFFF, "".b) } x.report("uint64_enc3 (unroll better)") { uint64_enc3(0xFFFF_FFFF, "".b) } x.report("uint64_enc4 (Array#pack)") { uint64_enc4(0xFFFF_FFFF, "".b) } end ```

Results are below.

Interpreter:

$ ruby test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)   345.396k i/100ms
uint64_enc2 (unroll)   380.889k i/100ms
uint64_enc3 (unroll better)
                       415.911k i/100ms
uint64_enc4 (Array#pack)
                       319.319k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)      3.413M (± 2.4%) i/s -     17.270M in   5.063010s
uint64_enc2 (unroll)      3.743M (± 1.2%) i/s -     19.044M in   5.088180s
uint64_enc3 (unroll better)
                          4.119M (± 0.6%) i/s -     20.796M in   5.048463s
uint64_enc4 (Array#pack)
                          3.144M (± 0.7%) i/s -     15.966M in   5.078173s

YJIT:

$ ruby --yjit test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) +YJIT [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)     1.067M i/100ms
uint64_enc2 (unroll)     1.095M i/100ms
uint64_enc3 (unroll better)
                         1.089M i/100ms
uint64_enc4 (Array#pack)
                       640.782k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)     11.533M (± 0.5%) i/s -     58.679M in   5.087999s
uint64_enc2 (unroll)     11.700M (± 0.6%) i/s -     59.114M in   5.052717s
uint64_enc3 (unroll better)
                         11.760M (± 0.5%) i/s -     59.869M in   5.090842s
uint64_enc4 (Array#pack)
                          6.785M (± 1.4%) i/s -     33.961M in   5.006533s

Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT. I'm a little bit surprised by how slow the Array#pack solution is on YJIT, I expected it to fare better since I merged https://github.com/ruby/ruby/pull/8734

maximecb commented 1 month ago

Might make sense to look at the disasm and make sure that your optimization is getting applied, see what the generated code looks like.

I'm a bit surprised that unrolling doesn't help YJIT at all. Wondering if there is a big source of overhead somewhere else or something?