astei / varint-writing-showdown

How fast can you write a Google Protocol Buffer (signed) VarInt?
MIT License
11 stars 2 forks source link

Precompute varint lengths, 64 bit values #1

Open richardstartin opened 3 years ago

richardstartin commented 3 years ago

Hi, I read your blog post. It was interesting so thanks for writing it. The function called "startin" in your blog is a bit different to what's in my post because my implementation precomputes the lengths and puts them in a lookup table addressable by the number of leading zeros, so doesn't do the division during the encoding:

    private static final int[] VAR_INT_LENGTHS = new int[65];

    static {
        for (int i = 0; i <= 64; ++i) {
            VAR_INT_LENGTHS[i] = (63 - i) / 7;
        }
    }

    int varIntLength(long value) {
        return VAR_INT_LENGTHS[Long.numberOfLeadingZeros(value)];
    }

    private void writeVarInt(int offset, long value) {
        int length = varIntLength(value);
        for (int i = 0; i < length; ++i) {
            buffer[offset + i] = ((byte) ((value & 0x7F) | 0x80));
            value >>>= 7;
        }
        buffer[offset + length] = (byte) value;
    }

I think you will find different results if you consider long inputs too, where, as you note, manual unrolling gets ugly quickly, and you will start to fall foul of inlining heuristics as code size increases. I found the approach outlined in my blog post (with the lookup table) starts beating the most obvious implementation at 3 nonzero bytes in the input, and at 60 bytes it is very friendly to C2's inlining policies.

Obviously, varint encoding makes less sense as the inputs get larger, so optimising for longer inputs is questionable, but there are a lot of cases where the encoder doesn't have a choice and must handle unpredictable inputs.

jasonk000 commented 1 year ago

Stumbled on this via the mentioned kafka ticket; adding a note:

I benchmarked both with a lookup table and a bitshift-based DIV against the prior implementation found here and the bitshift and lut approach were both very close and both much faster than the divide-based implementation. (The divide-based impl is currently used in this repo).

https://github.com/apache/kafka/pull/11721.