Optimize the fast-path of 0 for varint

JAicewizard commented 4 years ago

For varint decoding there is no fast-path for under 128(which fits within the 7 bits) and for encoding there is a fast-path for 0. This could be extended to 128 easily and still needs to set 2 variables(n and i) even though they are only used when n > 0.

dermesser commented 4 years ago

Can you elaborate on the goal of this?

Do you have quantitative evidence of varint code needing a fast path?

JAicewizard commented 4 years ago

If n<128 no actual operation or calculation is required. Adding a single condition to be able to manage all relatively small numbers seems to be trivial. For encoding there already is a fast-path for n==0, but this should be extendable to n<128. Both showed up in my benchmarks, and adding the fast-path for both improved performance by ~5%

dermesser commented 4 years ago

Interesting! Would you mind sending a PR?

dermesser commented 4 years ago

if you have useful benchmarks, feel free to add them too.

JAicewizard commented 4 years ago

hmm, I tried to benchmark my program with/without special cases but they seem to be pretty much the same at the moment, I think that is because all numbers I encode are now constant and inlined so it can optimize it away anyways. In benchmarks where I initially added fast-paths I could get up to 20% improvement.

Right now I am measuring 12.3 ns for 10 fast-path encodes and 14.8 ns for unoptimized encode for ints under 64 where everything falls under the encoding fast-path. so the fast-path can improve performance by 17% where it applies. I am also getting better performance for cases that fall outside the fast-path (13.5ns vs 14.8) which I found interesting.

On another note, I found that for all numbers this implementation is off the golang version, written by google. for numbers specified in this document it also doesn't match even though it is linked in the readme. Is this known?

dermesser commented 4 years ago

On another note, I found that for all numbers this implementation is off the golang version, written by google. for numbers specified in this document it also doesn't match even though it is linked in the readme. Is this known?

where do you see the mismatch? I added a simple interactive encoder and it gives me the same result as the google document:

1
fixed: 1 encoded (unsigned): 00000001  encoded (signed): 00000010 
300
fixed: 100101100 encoded (unsigned): 10101100 00000010  encoded (signed): 11011000 00000100

JAicewizard commented 4 years ago

oops!!! I am so sorry if this took a lot of your time, I tried to make it recreatable in the rust playground, and while transforming the method into a separate function with explicit type u64 I realized that rust decided that 150 was a signed integer instead of unsigned.

so here are the reran numbers on my machine for encoding. All functions are within the same crate and module. (again 10 runs per "iteration" to make per iteration overhead irrelevant.)

no forced inlining

between 0x0 and 0x80: original: 14.8 fast-path: 12.3 port of golang: 12.4

between 0x80 and 0xff: original: 18.4 fast-path: 18.5 port of golang: 14.8

between 0x0 and 0xff: original: 15.8 fast-path: 14.6 port of golang: 13.1

with inlining force enabled:

between 0x0 and 0x80: original: 14.9 fast-path: 12.3 port of golang: 12.4

between 0x80 and 0xff: original: 18.9 fast-path: 18.8 port of golang: 14.8

between 0x0 and 0xff: original: 16.8 fast-path: 15.3 port of golang: 13.4

as you can see the golang port is not faster within 0x0 and 0x80, as long as the fast-path is applied, but for numbers between 0x80 and 0xff it is faster. Another thing you can see is that forcefully inlining doesn't make a difference. But cargo doesnt do inlining cross-crate even for very small functions. Calling the encode function from your crate creates worse performance because of inlining:

between 0x0 and 0x80: 17.7 between 0x80 and 0xff: 22.4 between 0x0 and 0xff: 20.4

So even adding an inline attribute would improve performance

Also some benchmarks for the size function (again per 10):

cross-crate: 13.3 in-crate: 7.5 modified*: 6.6

as you can see adding inline to that function could also drastically improve performance.

I havnt done any benchmarks for decoding but if you are interested I can do those too, currently off from school since the global health crisis...

*modified version, basically the go encoding but without the writing data.

go port of varint encoding:

let mut i = 0;
while u >= 0x80 {
    dst[i] = u as u8 | 0x80;
    u >>= 7;
    i += 1;
}
dst[i] = u as u8;
i + 1

dermesser commented 4 years ago

Thank you for the code sample, it makes it clear what you had in mind. I pushed commit c742e40, let me know if it's correct and achieves the desired benchmark performance. The tests pass, and you can pull branch varint-optimize to check it out.

JAicewizard commented 4 years ago

I would also recommend adding inline attributes to allow inlining(not force it) and remove the outor if-statement since doesn't do anything. now it checks first if its equal to zero and then less then to 0x80 and they both result in (almost) the same code running. It doesn't improve performance and makes the code look unclear.

I would recomend just this:

let mut i = 0;
while u >= 0x80 {
    dst[i] = u as u8 | 0x80;
    u >>= 7;
    i += 1;
}
dst[i] = u as u8;
i + 1

Also you have constants at the top of the file, note that 0x80 == 0b10000000 aka MSB so if you prefer to use that do that.

dermesser commented 4 years ago

Thanks, I overlooked this. I've also used your proposed algorithm for signed integers now.

JAicewizard commented 4 years ago

That all looks very good to me, I guess I should close the issue?

I did some benchmarking for decoding but didn't find any improvements changing the decoding function.

dermesser / integer-encoding-rs

Optimize the fast-path of 0 for varint #8

no forced inlining

with inlining force enabled: