Closed JAicewizard closed 4 years ago
Can you elaborate on the goal of this?
Do you have quantitative evidence of varint code needing a fast path?
If n<128
no actual operation or calculation is required. Adding a single condition to be able to manage all relatively small numbers seems to be trivial.
For encoding there already is a fast-path for n==0
, but this should be extendable to n<128
.
Both showed up in my benchmarks, and adding the fast-path for both improved performance by ~5%
Interesting! Would you mind sending a PR?
if you have useful benchmarks, feel free to add them too.
hmm, I tried to benchmark my program with/without special cases but they seem to be pretty much the same at the moment, I think that is because all numbers I encode are now constant and inlined so it can optimize it away anyways. In benchmarks where I initially added fast-paths I could get up to 20% improvement.
Right now I am measuring 12.3 ns for 10 fast-path encodes and 14.8 ns for unoptimized encode for ints under 64 where everything falls under the encoding fast-path. so the fast-path can improve performance by 17% where it applies. I am also getting better performance for cases that fall outside the fast-path (13.5ns vs 14.8) which I found interesting.
On another note, I found that for all numbers this implementation is off the golang version, written by google. for numbers specified in this document
it also doesn't match even though it is linked in the readme. Is this known?
On another note, I found that for all numbers this implementation is off the golang version, written by google. for numbers specified in
this document
it also doesn't match even though it is linked in the readme. Is this known?
where do you see the mismatch? I added a simple interactive encoder and it gives me the same result as the google document:
1
fixed: 1 encoded (unsigned): 00000001 encoded (signed): 00000010
300
fixed: 100101100 encoded (unsigned): 10101100 00000010 encoded (signed): 11011000 00000100
oops!!! I am so sorry if this took a lot of your time, I tried to make it recreatable in the rust playground, and while transforming the method into a separate function with explicit type u64 I realized that rust decided that 150 was a signed integer instead of unsigned.
so here are the reran numbers on my machine for encoding. All functions are within the same crate and module. (again 10 runs per "iteration" to make per iteration overhead irrelevant.)
between 0x0 and 0x80: original: 14.8 fast-path: 12.3 port of golang: 12.4
between 0x80 and 0xff: original: 18.4 fast-path: 18.5 port of golang: 14.8
between 0x0 and 0xff: original: 15.8 fast-path: 14.6 port of golang: 13.1
between 0x0 and 0x80: original: 14.9 fast-path: 12.3 port of golang: 12.4
between 0x80 and 0xff: original: 18.9 fast-path: 18.8 port of golang: 14.8
between 0x0 and 0xff: original: 16.8 fast-path: 15.3 port of golang: 13.4
as you can see the golang port is not faster within 0x0 and 0x80, as long as the fast-path is applied, but for numbers between 0x80 and 0xff it is faster. Another thing you can see is that forcefully inlining doesn't make a difference. But cargo doesnt do inlining cross-crate even for very small functions. Calling the encode function from your crate creates worse performance because of inlining:
between 0x0 and 0x80: 17.7 between 0x80 and 0xff: 22.4 between 0x0 and 0xff: 20.4
So even adding an inline attribute would improve performance
Also some benchmarks for the size function (again per 10):
cross-crate: 13.3 in-crate: 7.5 modified*: 6.6
as you can see adding inline to that function could also drastically improve performance.
I havnt done any benchmarks for decoding but if you are interested I can do those too, currently off from school since the global health crisis...
*modified version, basically the go encoding but without the writing data.
go port of varint encoding:
let mut i = 0;
while u >= 0x80 {
dst[i] = u as u8 | 0x80;
u >>= 7;
i += 1;
}
dst[i] = u as u8;
i + 1
Thank you for the code sample, it makes it clear what you had in mind. I pushed commit
c742e40, let me know if it's correct and achieves the desired benchmark performance. The tests pass, and you can pull branch varint-optimize
to check it out.
I would also recommend adding inline attributes to allow inlining(not force it) and remove the outor if-statement since doesn't do anything. now it checks first if its equal to zero and then less then to 0x80 and they both result in (almost) the same code running. It doesn't improve performance and makes the code look unclear.
I would recomend just this:
let mut i = 0;
while u >= 0x80 {
dst[i] = u as u8 | 0x80;
u >>= 7;
i += 1;
}
dst[i] = u as u8;
i + 1
Also you have constants at the top of the file, note that 0x80 == 0b10000000
aka MSB
so if you prefer to use that do that.
Thanks, I overlooked this. I've also used your proposed algorithm for signed integers now.
That all looks very good to me, I guess I should close the issue?
I did some benchmarking for decoding but didn't find any improvements changing the decoding function.
For varint decoding there is no fast-path for under 128(which fits within the 7 bits) and for encoding there is a fast-path for 0. This could be extended to 128 easily and still needs to set 2 variables(n and i) even though they are only used when
n > 0
.