WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 696 forks source link

measure performance of other variable-length integer encoding schemes #601

Closed lukewagner closed 8 years ago

lukewagner commented 8 years ago

As suggested by this HN comment we should measure the comparative decode-time differences between LEB128 and the described PrefixVarint. Some of the advantages won't apply since we're mostly dealing with 1- or 2-byte varints, but, still, having the tag bits in the LSB might allow an optimized decoder to reduce branching.

mbebenita commented 8 years ago

@tterribe points out that we could reverse the prefix bits so that bsr could be used to quickly find the length of the remaining bytes.

qwertie commented 8 years ago

@lukewagner Now that you mention it, wouldn't this encoding be nice?

0xxxxxxx 7 bits for the number
10xxxxxx xxxxxxxx 14 bits for the number
110xxxxx xxxxxxxx xxxxxxxx 21 bits for the number
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx 28 bits for the number
1111nnnn xxxxxxxx .... n+2 total bytes, with (n+1)*8 bits for the number

This is a combination of two different encoding styles, depending on the top 4 bits, and I'm just proposing it on the guess that an optimized implementation of this would be simpler than the PrefixVarInt. It doesn't have the nice property of "uses exactly the same # of bits as LEB128" - instead it uses

LEB128 holds up to This encoding holds up to
28 bits in 4 bytes 28 bits in 4 bytes
35 bits in 5 bytes 32 bits in 5 bytes
42 bits in 6 bytes 40 bits in 6 bytes
49 bits in 7 bytes 48 bits in 7 bytes
56 bits in 8 bytes 56 bits in 8 bytes
63 bits in 9 bytes 64 bits in 9 bytes
70 bits in 10 bytes 72 bits in 10 bytes
... ...
119 bits in 17 bytes 128 bits in 17 bytes
126 bits in 18 bytes N/A
133 bits in 19 bytes N/A

Note: whether to use sign-extension or zigzag is an orthogonal decision.

P.S. if one wants to reduce the number of possible encodings for numbers <= 28 bits, use (n+4)*8 instead of (n+1)*8 in the formula above, or else error out if n<=3. This still doesn't force a unique encoding of each number, though.

lukewagner commented 8 years ago

Good input, thanks! I wasn't intending to drive this to resolution now (anyone else feel free to though), just filing this as something to ask and measure before MVP.

qwertie commented 8 years ago

@mbebenita Yay for solutions I don't think of because after 15-40 years, C++/C#/Java still don't (officially) support them.

stoklund commented 8 years ago

I ran some performance numbers for a modified version of the prefix encoded variable length integers described here.

TL;DR: Decoding PrefixVarint is about 3x faster than decoding LEB128.

The prefix encoding I used limits the integers to 64 bits so that the length of the encoded integer can be determined from the first byte:

xxxxxxx1  7 bits in 1 byte
xxxxxx10 14 bits in 2 bytes
xxxxx100 21 bits in 3 bytes
xxxx1000 28 bits in 4 bytes
xxx10000 35 bits in 5 bytes
xx100000 42 bits in 6 bytes
x1000000 49 bits in 7 bytes
10000000 56 bits in 8 bytes
00000000 64 bits in 9 bytes

This has the advantage that the total encoded length is limited to 9 bytes maximum, and it can be computed by a branch free function:

static inline size_t
prefix_length(const uint8_t* p)
{
  return 1 + count_trailing_zeros_32(*p | 0x100);
}

Decoding the integer has a single highly predictable branch:

static inline uint64_t
prefix_decode(const uint8_t* p, size_t length)
{
  if (length < 9) {
    size_t unused = 64 - 8 * length;
    return unaligned_load_64(p) << unused >> (unused + length);
  } else {
    return unaligned_load_64(p + 1);
  }
}

I compared against LLVM's LEB128 implementation which is pretty straightforward.

Both decoders depend on the branch predictor for good performance, LEB128 moreso than PrefixVarint, so it makes a difference how the numbers to be decoded are distributed. I used log-uniform random 64-bit numbers, which means that log2(X) is uniformly distributed over the interval [0;64]. This choice was inspired by Benford's law, but something more realistic for WebAssembly would be desirable.

Since WebAssembly currently restricts its LEB128 numbers to 32 bits, I also measured with 32-bit log-uniform numbers. In this range, the PrefixVarint decoder is essentially branch-free since the single branch is 100% predictable.

I measured on a Broadwell laptop (i7-5557U @ 3.10GHz) and a Nehalem iMac (i7-870 @ 2.93GHz). The table below shows how much faster PrefixVarint decodes 100,000 integers compared to LEB128.

Log-uniform bits 16 32 48 56 64
Broadwell 1.5x 2.2x 2.5x 2.7x 2.4x
Nehalem 1.8x 2.8x 3.4x 3.7x 3.1x

LEB128 performs better when the range of the log-uniform numbers is reduced because the decoder loop has fewer turns and fewer branch mispredictions. PrefixVarint is only affected by branch mispredictions when the log-uniform numbers become larger than 56 bits, so it compares most favorably at 56-bit log-uniform numbers.

Test program: varint.zip

haberman commented 8 years ago

@stoklund Thanks for posting these! I'm a member of the Protocol Buffers team at Google, so I have some background and interest in all of this. I've been a fan of PrefixVarint for a while. A few comments:

revanistT3-zz commented 8 years ago

Sorry, noob here. Does this make wasm generic between 32/64 bit, or help? It seems like you guys throw around solutions sometimes, but I can't keep track.

binji commented 8 years ago

Here are some real world data on LEB lengths, using the examples from https://github.com/WebAssembly/build-suite:

Please note that these are generated via Binaryen's asm2wasm, so are not representative of what will be produced by the LLVM backend (which won't lower 64-bit operations to 32-bit operations).

Also note that the 5-byte values are slightly more common than required, since the spidermonkify.py (which produced these binaries) always writes non-canonical LEBs for the section and function body lengths. (see the last example of BananaBread produced by sexpr-wasm-prototype, which canonicalizes all LEBs by default).

hello_iostream
1 bytes: 74922 (92.385%)
2 bytes: 4051 (4.995%)
3 bytes: 1266 (1.561%)
4 bytes: 35 (0.043%)
5 bytes: 824 (1.016%)
total lebs: 81098
total size of module: 198573
hello_world
1 bytes: 5965 (94.174%)
2 bytes: 292 (4.610%)
3 bytes: 16 (0.253%)
4 bytes: 6 (0.095%)
5 bytes: 55 (0.868%)
total lebs: 6334
total size of module: 14456
AngryBots:
1 bytes: 4736409 (90.077%)
2 bytes: 243305 (4.627%)
3 bytes: 169135 (3.217%)
4 bytes: 63328 (1.204%)
5 bytes: 46020 (0.875%)
total lebs: 5258197
total size of module: 12027763
BananaBread:
1 bytes: 863599 (90.742%)
2 bytes: 44986 (4.727%)
3 bytes: 38192 (4.013%)
4 bytes: 736 (0.077%)
5 bytes: 4199 (0.441%)
total lebs: 951712
total size of module: 2401314
BananaBread w/ canonical LEB128s:
1 bytes: 864941 (90.883%)
2 bytes: 47220 (4.962%)
3 bytes: 38199 (4.014%)
4 bytes: 738 (0.078%)
5 bytes: 614 (0.065%)
total lebs: 951712
total size of module: 2389229
stoklund commented 8 years ago

Thanks, @binji! With such a heavy bias towards the 1-byte integers, I think we should also consider something like the SQLite variable-length integers. It might compress better.

stoklund commented 8 years ago

@haberman code at https://github.com/stoklund/varint.

mbebenita commented 8 years ago

Out of curiosity, the number of significant bits of each value encoded with LEB128 in AngryBots:

Name Count Size % Total Avg. Size
writeLEB128 .. < n < 2^1 475020 463.887K 3.804 1.0000
writeLEB128 .. < n < 2^2 464983 454.085K 3.723 1.0000
writeLEB128 .. < n < 2^3 582355 568.706K 4.663 1.0000
writeLEB128 .. < n < 2^4 428868 418.816K 3.434 1.0000
writeLEB128 .. < n < 2^5 245084 239.340K 1.962 1.0000
writeLEB128 .. < n < 2^6 150840 147.305K 1.208 1.0000
writeLEB128 .. < n < 2^7 97489 95.204K 0.781 1.0000
writeLEB128 .. < n < 2^8 41921 81.877K 0.671 2.0000
writeLEB128 >= 2^8 183019 446.903K 3.664 2.5004

% Total is % of file size.

binji commented 8 years ago

@mbebenita strange, the sum of your counts is much less than my counts for the number of LEB128s in AngryBots. Are you including the signed i32.consts too?

lukewagner commented 8 years ago

Note: the opcode-table-with-fixed-immediates change would likely significantly change the distribution of indices and, I expect, produce a more even distribution among the byte lengths.

mbebenita commented 8 years ago

@binji I instrumented this function: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h#L107

It looks like Binaryen does't use LEB128 for i32.const, which is why the counts are probably off - https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h#L785

stoklund commented 8 years ago

I updated my benchmark to read a test vector from a file instead of generating random numbers. I used

wasm-as -d bb.wast -o /dev/null 2>&1 | awk '/writeLEB128/ { print $2 }'

to generate 465936 integers with a more realistic distribution. These numbers encode to 1.059 bytes/integer.

With such a heavy bias to single-byte encoded values, the PrefixVarint decoder is actually faster with an extra branch:

void prefix_decode(const uint8_t *in, uint64_t *out, size_t count) {
  while (count-- > 0) {
    if (LIKELY(*in & 1)) {
      *out++ = *in++ >> 1;
    } else {
      size_t length = prefix_length(in);
      *out++ = prefix_get(in, length);
      in += length;
    }
  }
}

This makes me sad, but what can you do?

I make a similar change to my LEB128 decoder for fairness. The PrefixVarint decoder is still 1.11x faster with this data:

Read 465936 integers from /Users/jolesen/build/bb.leb128.
LEB128: 1.059 bytes/integer.
LEB128: 4.609e-04 secs.
PrefixVarint: 1.059 bytes/integer.
PrefixVarint: 4.152e-04 secs.
T(LEB128) / T(PrefixVarint) = 1.110.

Updated code in my repo.

stoklund commented 8 years ago

A modified version of the SQLite encoding also performs well on the BananaBread numbers:

Read 465936 integers from /Users/jolesen/build/bb.leb128.
LEB128: 1.059 bytes/integer.
LEB128: 4.593e-04 secs.
PrefixVarint: 1.059 bytes/integer.
PrefixVarint: 4.066e-04 secs.
leSQLite: 1.056 bytes/integer.
leSQLite: 3.733e-04 secs.
T(LEB128) / T(PrefixVarint) = 1.130.
T(LEB128) / T(leSQLite) = 1.230.
ghost commented 8 years ago

FWIW, if you make prefix varint big-endian and special-case the one-byte encoding then you can remove the shift (and on x86 the bit test) from the fast path so you have simply if ((int8_t)*in >= 0) *out++ = *in++;. For the remaining cases you gain a byte-reverse operation and not much more (a shift index gets multiplied by 7 rather than 8, which might not suit the lea opcode, which seems to be faster than other arithmetic).

I know that's a micro-optimisation, but it gives parity to leb128 and prefix fast paths.

flagxor commented 8 years ago

Implementors have discussed and concluded that while alternative encodings might offer improved size, the intention to support layer 1 (allowing outer wrapping format specific compression), and the degree to which LEBs are well known, outweighs this advantage for MVP. Closing.

chicoxyzzy commented 7 years ago

@haberman do you have any implementation/examples of Protocol Buffers decoder in wasm? This is something I'm interested in

chmike commented 6 years ago

I know WebAssembly picked LEB128, but I was just curious why you didn't benchmark prefixVarInt using one's in the most significant bits ?

The difference is that with one byte encoding (values < 128) which is also the most frequent, no shift are required.

qwertie commented 2 years ago

@chmike I would've liked to see that too.

The whole process of MVP's design was really disheartening. @stoklund says "look, this solution is 2x-3x as fast as LEB128!" and the final one-sentence rationale doesn't even mention performance, instead making a nonsensical reference to "improved size" (PrefixVarInt and LEB128 are the same size) and "layer 1" (never mind that layer 1 by its very nature cannot make decoding performance better).

It's a lot like my text format proposal that was closed with no rationale. I would have been happy to work full time pro bono on that text format, but oh well.

chmike commented 2 years ago

The only argument in favor of LEB128 is that it allows to locate its end with random access. It is impossible with prefix var ints. The choice a priori between slower and impossible may be toward slower to avoid the impossible.

It make sense for some applications like UTF8 encoding to have picked LEB128 encoding, because random access is very likely to happen. Would one use random access with web assembly ? What might be the use cases ?

dead-claudia commented 2 years ago

Edit: fix a link in the details and expand a bit more on some things. Edit 2: fix bugs in the linked code in the algorithm description

I just wanted to chime in and note that this could be significantly accelerated through the use of bit field manipulation and some limited overfetching to parallelize the loads and stores. I'm not aware of any implementations that do this, though. But in any case, I'm not persuaded by the PrefixVarint format having nearly the performance benefits, not when you can overfetch and avoid the load -> load dependency bottleneck entirely, even if the data doesn't fall neatly onto a single cache line. (Here's an archive.org link, as the repo appears to be since deleted.) Unless the data is all somehow in L1 cache (not something you can rely on), most the time spent is going to be waiting on memory loads and stores, and so there's little the PrefixVarint offers that helps eliminate that final load that I've done here.

The general intuition here for my algorithm is to treat values post-normalization as essentially a SWAR of 1-bit values and to leverage the implicit modulo you get from ignoring extra bits to both be able to do more things in parallel and get away with overfetching and overstoring to further enhance this:

This is a common trick used in SIMD processing, actually.

Here's the general algorithm itself > I'd use [`PDEP`](https://www.felixcloutier.com/x86/pdep) and [`PEXT`](https://www.felixcloutier.com/x86/pext) instructions ([introduced by Intel in 2013 and AMD in 2015](https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#BMI2_(Bit_Manipulation_Instruction_Set_2))) to accelerate it further except...well...AMD pre-Zen 3 [microcoded them and thus left both of them with rather high latency and similarly low throughput](https://www.agner.org/optimize/instruction_tables.pdf). On all Intel's stuff as well as on AMD Zen 3 it's a single cycle operation, and if I used that, I could cut the operations here by roughly half. MIPS has similar instructions, but I don't know their performance. Also, ARM doesn't have anything like this at all, and so you'd have to use this fallback anyways. (Luckily ARM *does* have `UBFX` and `BFI`, so this can be paired with `ORR` + barrel shifts to do it in just 2 steps each. It also has a few niceties noted later as well.) > > These are all for 32-bit integers, but it'd be fairly straightforward to adapt for 64-bit integers. ([Here's an (untested) example 32-bit decode implementation for 64-bit architectures.](https://godbolt.org/z/djGsMc9da) Considering the code size, it does seem fairly reasonable. And 64-bit ARM is pretty similar in code size, too.) > > This also takes advantage of the fact LEB128 is stored little-endian, so I don't need to byte-swap anywhere, but of course working around that is as simple as a `bswap` + possible shift. > > Note: you should consider implementing the size dispatching as a `switch` to allow the processor to optimize dispatching, as I did in my 32-bit integer decode implementation for 64-bit architectures. > And one final note: none of this is actually tested, so it might be subtly incorrect. But you get the drift. - Unsigned encode, for a given input source: - Encoded size: ceil(log2(max(n, 1)) / 7) - On 64-bit architectures, if enough extra not-later-read space is available to simply write a single 64-bit value, use 8 for the writable size to save a step. This can be used in instead of the above calculation for a mild speedup, though note that this only works for 64-bit as 32-bit doesn't gain anything out of it. - If not enough space is available to write past the end, the writable size is just the encoded size. - On 32-bit architectures: - Set high to bits 28-31 of source, zero-extended - Set low to bits 0-27 of source - Set result to the following, produced via bitwise OR for overlapping regions: - Bit 0-6: bits 0-6 of low - Bit 7: literal 1 - Bit 8-14: bits 7-13 of low - Bit 15: literal 1 - Bit 16-22: bits 14-20 of low - Bit 23: literal 1 - Bit 24-30: bits 21-27 of low - Bit 31: literal 1 - Clear the high bit of the one-indexed `size`-th byte of result, to indicate the end, if high is zero - This can be done in a couple short operations via `result & ~((high == 0) << ((size - 1) * 8 - 1))` ([compiles to 4 instructions on 32-bit ARM](https://godbolt.org/z/8TMMjnsjc)) - The condition is just to do this only if there's only at most 4 bytes to write (as in, it takes more than 32 bits to represent) - If writable size is 1: store8(i, low) - If writable size is 2: store16_le(i, low) - If writable size is 3: store16_le(i, low) - If writable size is 3: store8(i, unsigned(low) >> 16) - If writable size is 4: store32_le(i, low) - If writable size is 5: store32_le(i, low) then store8(i+4, high) - On 64-bit architectures: - Set result to the following, produced via bitwise OR for overlapping regions: - Bit 0-6: bits 0-6 of source - Bit 7: literal 1 - Bit 8-14: bits 7-13 of source - Bit 15: literal 1 - Bit 16-22: bits 14-20 of source - Bit 23: literal 1 - Bit 24-30: bits 21-27 of source - Bit 31: literal 1 - Bit 32-35: bits 28-31 of source - Bit 36-39: literal 0 - Remainder can be anything as it's not used and masked/shifted away - Clear the high bit of the one-indexed `size`-th byte of result, to indicate the end - This can be done very simply as `result & ~(1 << ((size - 1) * 8 - 1))` (which compiles to [5 instructions on AArch64](https://godbolt.org/z/vr5Y166b1) and [4 on x86-64](https://godbolt.org/z/WKr1c7raG)) - The condition is just to do this only if there's only at most 4 bytes to write (as in, it takes more than 32 bits to represent) - If size is 1: store8(i, target) - If size is 2: store16_le(i, target) - If size is 3: store16_le(i, target) - If size is 3: store8(i, unsigned(target) >> 16) - If size is 4: store32_le(i, target) and return - If size is 5: store32_le(i, target), store8(i+4, unsigned(target) >> 32) - If size is 8: store64_le(i, target) - Unsigned decode 32-bit: - readable_size = number of readable bytes ahead - Set high and low to initially be both 0 (high can even be 8-bit, but low must be pointer-size) - If readable_size is 1: set low to load8(i) - If readable_size is 2: set low to load16_le(i) - If readable_size is 3: set low to load16_le(i) | load8(i+2) << 16 - If readable_size is 4: set low to load32_le(i) - If readable_size is 5-7 on 64-bit architectures or if readable_size >= 5 on 32-bit architectures: - Set low to load32_le(i) - Set high to load8(i+4) - If readable_size >= 5 on 64-bit architectures: Set low to load64_le(i) - Set result to the following, produced via bitwise OR for overlapping regions: - Bits 0-6: bits 0-6 of low - Bits 7-13: bits 8-14 of low - Bits 14-20: bits 16-22 of low - Bits 21-27: bits 24-30 of low - 64-bit only: Bits 28-31: 32-35 of low - will be all clear if only 32 bits loaded - Bits 28-31: Bits 0-6 of high - will be all clear unless loaded - Note: bits above what's set here in the result are freely ignored and thus can be filled with whatever's easiest to produce. - If bit 7 of low is 0, return result & 0x0000007F - If bit 15 of low is 0, return result & 0x00003FFF - If bit 23 of low is 0, return result & 0x001FFFFF - If bit 31 of low is 0, return result & 0x0FFFFFFF - If bit 7 of high or (on 64-bit) 47 of low is 0, return result - Fail - Signed encode/decode would look similar, just with a couple changes: - The size calculation would be slightly different for encode as it'd need to account for negatives, the high value would need to be sign-extended on extraction, and the low (for 32-bit) and target (for 64-bit) values would additionally need sign-extended from its size if high is zero. - For decoding, the returns would all need sign-extended. Luckily, it's as simple as `signed(result << size) >> size` for x86, and ARM has a convenient `SBFX` that can *replace* the bitwise AND (it'd be specifically `SBFX{cond} Rd, Rn, #0, #size` *instead* of the bitwise AND) for the immediates here, and so it'd be zero cost. (A compiler would be wise to use `UBFX` with the same operands for the unsigned variant.)
chmike commented 2 years ago

You should benchmark it.

I’m working with the Go programming language and explored many different varying length integer encoding. I also experimented with various different implementations. The most efficient code is made available here: https://github.com/chmike/varint. This code minimize the number of machine instructions to execute. I provide benchmarks comparing it with the LEB128 encoding provided in the Go standard library.

Through my experiences I have seen that any bit fiddling make it less efficient. I thus saw that postfix encoding is less efficient. It was unexpected as I thought that writing the 64 bit value in little endian order would be faster.

LEB128 encoding for small values is faster because the function is inlined. But even though my function is not inlined, my encoding function beats LEB128 encoding when the values have more than 14 significant bits. The LEB128 decoding is significantly worse.

My code could beat LEB128 if the encoding of small values up to 14 significant bits could be inlined. It is currently not possible with the Go compiler but it may change in the future.

Note that I only benchmark the encoding and decoding operations. There are other operations around it that do also affect the overall performance like testing the return value, or handling the panic (Go’s exceptions).

Note also that my code benchmarks may have been advantaged by branch predictions. A better test would be to use a realistic random distribution of value size.

dead-claudia commented 2 years ago

Note that the scatter and gather steps in my above algorithm are fairly special-cased to LEB128, but only because the bit vs data are in the same position in each byte and so I can move the value bits into place without dependency on the length. For PrefixVarint, moving bits into place is a simple shift (2 shifts with an OR if the whole value exceeds pointer width), but I would have to do a __builtin_ffs(first_byte | 0x100) to know how far. I suspect it'll still be somewhat faster, but not by nearly as much due to all the data parallelism in the LEB128 algorithm.

@chmike If I can find time, I might be able to give you hard numbers for 32- and 64-bit encode and decode. I find it unlikely the 6-μop 18-cycle PEXT in AMD processors all the way up through Zen 2 would give me much of a speed-up, but I may try anyways as well if I can (as I'm running a Zen 2 processor that has the slow microcoded version). I won't have ARM numbers for you, though. At least for now, here's a code comparison between the two formats for x86-64 and AArch64.

Had AMD implemented PDEP/PEXT in hardware from the start like like Intel did, both entire result = ... steps in the decode could be easily folded into a set of conditional moves to determine the mask and a final _pext_u32(low | high & 0x7FULL << 28, mask) to get the result, fusing several parallel ands, ors, and shifts that'd take around 8-10 cycles to complete (if I'm reading the generated assembly and comparing it to Agner Fog's chart correctly) followed by some moderate branching into a series of branch-free conditional moves followed by a simple single-cycle MOV r64, imm64 + single 3-cycle pdep rax, r64 that I can then immediately return. (It's the scale of the branching that makes me wonder if I can actually get away with it even on Zen 2.)