Closed caibear closed 8 months ago
Benchmarks for those who are interested. Previous version of bitcode isn't shown here but it has speed similar to bincode, size simliar to new bitcode, and compressed size 20% worse than bincode.
Format | Compression | Size (bytes) | Serialize (ns) | Deserialize (ns) |
---|---|---|---|---|
bincode | 49.1 | 35 | 115 | |
bincode | lz4 | 16.1 | 86 | 115 |
bincode | deflate-fast | 13.1 | 166 | 176 |
bincode | deflate-best | 8.9 | 3708 | 141 |
bincode | zstd-0 | 12.4 | 172 | 146 |
bincode | zstd-22 | 8.5 | 32312 | 133 |
bincode-varint | 22.3 | 36 | 116 | |
bincode-varint | lz4 | 10.8 | 71 | 119 |
bincode-varint | deflate-fast | 10.1 | 146 | 165 |
bincode-varint | deflate-best | 8.0 | 2664 | 153 |
bincode-varint | zstd-0 | 8.2 | 123 | 131 |
bincode-varint | zstd-22 | 7.8 | 23939 | 136 |
bitcode | 16.9 | 29 | 104 | |
bitcode | lz4 | 9.8 | 42 | 108 |
bitcode | deflate-fast | 8.3 | 86 | 137 |
bitcode | deflate-best | 6.8 | 1661 | 125 |
bitcode | zstd-0 | 7.1 | 58 | 115 |
bitcode | zstd-22 | 6.2 | 24713 | 115 |
bitcode-derive | 16.9 | 10 | 12 | |
bitcode-derive | lz4 | 9.7 | 25 | 17 |
bitcode-derive | deflate-fast | 8.3 | 68 | 45 |
bitcode-derive | deflate-best | 6.8 | 1597 | 33 |
bitcode-derive | zstd-0 | 7.1 | 40 | 24 |
bitcode-derive | zstd-22 | 6.2 | 24905 | 25 |
doesn't require any hints (determines them at runtime)
This is remarkable, is there any particular part of the branch you could point me to so I could see how it's done?
Side question, do you see any benefits in hinting on top of runtime determined?
Hope it's okay to ask the questions in your PR :) thanks for making this library
doesn't require any hints (determines them at runtime)
This is remarkable, is there any particular part of the branch you could point me to so I could see how it's done?
https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/pack.rs https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/pack_ints.rs https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/f32.rs (albiet this one requires the aid of a compression algorithm like deflate/lz4/zstd to do anything)
Side question, do you see any benefits in hinting on top of runtime determined?
After adding hints to bitcode, I didn't use them as much as I thought I would because it was tedious. The types of "packing" I'm using in this new version are designed to quickly determine if they're applicable and pack the data. I'm probably not going to add manual hints back because most people don't benefit from them (me included). Also this new version prioritizes working with general purpose compression which some of the old hints got in the way of (e.g. #[bitcode_hint(ascii)]
made characters 7 bits which confused byte-wise compression algorithms).
Hope it's okay to ask the questions in your PR :) thanks for making this library
Yeah! I made this PR public before being finished to see what people think about it.
Very impressive!!!
I just released bitcode = "=0.6.0-alpha.1"
which has the core features done.
Good work! I read through the code and have a few thoughts:
- Looks like enum with more than 256 variants is not supported, will it be supported in full release?
Not unless multiple people complain about it and even then it would probably have sub optimal performance. I've been avoiding it because calculating a histogram of an unbounded number of elements would require a HashMap.
- This definitely is a huge improvement over 0.5 for large messages with a lot of repeated patterns. But I'm guessing it won't do as well as 0.5 for small messages. It would be nice to measure how it compares against 0.5 in terms of speed and size for messages that have little long vec's but might have deep structures.
You are right that speed is worse on small messages, however I don't see this as a problem since small messages are probably not a bottleneck anyways. The main overhead is allocating a Vec for each field, but you can keep an EncodeBuffer around to reuse allocations. On the size front, small messages might be bigger uncompressed, but smaller if you have cross message compression since 0.5 doesn't compress well.
If it's worse against 0.5 for this use case, would it make sense to keep 0.5 around and release this in a new crate?
I thought about this, but I don't want to maintain two crates. Someone can always fork 0.5 and maintain it if they want.
- Would it make sense to optimize the case where len of a vec type is always 0? If an vec type always have len of 0, don't bother store the elements. Same thing goes for enum variant (didn't understand the code enough to know whether it already does). This is beneficial for deprecated data.
I considered this (including integers that are always 0), but it allows for a zip bomb. Currently deserialization memory usage is proportional to input size. (Edit: actually enum variants do this, but only with the derive macro).
- I'm personally interested in this crate but want it to support schema evolution. Opened a ticket: [FR: allow schema evolution (in or after 0.6) #21].
See my comment.
- I have a hard time imaging how recursive data can be supported. If it's hard (in terms of effort or ser/de speed), I think you might not want to support it at all.
One way would be to have a max recursion depth N and have N nested encoders.
struct A {
... // a lot of nested structures
}
struct B(Vec<Inner>);
fn serialize(b: &B) {...}
If I understand correctly, if b.0.len() is always 0, we would still have a lot of data storing A's definition, which is unnessesary. We still store b.0.len() as 0, which takes some space, so zip bomb is not possible.
- I think enum with more 256 variants can be easily supported with a compile time IF branch
Yes, certainly for derive (although this would require some code duplication). The tricky part is serde, where you don't know the total number of variants ahead of time (I suppose it could always encode them as u32 or something but it's not elegant).
- For 0.5 vs 0.6 on small messages, is there a benchmark to measure this?
No, because they're both called bitcode and they don't like to be imported at the same time. My comments about compressed size of small messages are based on us testing both version on our projects and eyeballing the results. Feel free to do your own benchmarks and share the results here!
- For vec with only zero length, I think zip bomb can still be avoided. I'll give an example:
struct A { ... // a lot of nested structures } struct B(Vec<Inner>); fn serialize(b: &B) {...}
If I understand correctly, if b.0.len() is always 0, we would still have a lot of data storing A's definition, which is unnessesary. We still store b.0.len() as 0, which takes some space, so zip bomb is not possible.
Consider Vec<Vec<u8>>
where all the inner Vecs are 0 length.
- IMO, zip bomb should not be a consideration on encoding spec, but on implementation. Interesting read on how capnproto handles zip bomb: https://capnproto.org/encoding.html#amplification-attack
Yeah this isn't a bad idea, but I didn't really want to worry about this for the initial version. We should revisit this once 0.6 is released.
Consider Vec<Vec
> where all the inner Vecs are 0 length.
Let's use Vec<Vec<A>>
as example.
This still isn't a problem, say it's [[],[],[]]
. We store 3, 0, 0, 0, but we shouldn't store anything about A
. Maybe I'm misunderstanding it and it's the current behavior?
Consider Vec
where all the inner Vecs are 0 length. Let's use
Vec<Vec<A>>
as example. This still isn't a problem, say it's[[],[],[]]
. We store 3, 0, 0, 0, but we shouldn't store anything aboutA
. Maybe I'm misunderstanding it and it's the current behavior?
Oh, nothing is currently stored for A even though the encoders exist. I thought you meant the 0s wouldn't get stored because all the Vecs are empty.
I'm curious how you're testing auto vectorisation. Are you manually enabling cpu features with rustflags, or using something like target native?
the alpha version on benchmarks looks nuts :)
I'm curious how you're testing auto vectorisation. Are you manually enabling cpu features with rustflags, or using something like target native?
Just using target-cpu=native .cargo/config.toml. x86_64 always has SSE2, so even without target-cpu=native there is still some autovectorization.
The autovectorization mainly improves encode_vectored by making it require about half the mov instructions. LLVM can replace scalar load+store with scalar load+vectored store since encode_vectored chunks 64 of each field a time. Loads are scalar because A
s aren't adjacent in memory when encoding Vec<(A, B)>
, but output looks like [A, A, A, ...]
so vectored store can be used.
the alpha version on benchmarks looks nuts :)
Thanks :)
I also plan to utilize autovectorization in decode with something like decode_vectored_in_place but it's currently limited by the inability to decode enums in place with MaybeUninit. Hopefully Rust will add a way to initialize the variant of an uninitialized enum.
I'm interested in hearing opinions about making bitcode::encode
/bitcode::decode
use a thread local bitcode::Buffer
.
This is how we internally use bitcode so I'm considering upstreaming it.
thread_local! {
static BUFFER: std::cell::RefCell<Buffer> = Default::default();
}
pub fn encode<T: Encode + ?Sized>(t: &T) -> Vec<u8> {
BUFFER.with(|b| b.borrow_mut().encode(t).to_vec())
}
pub fn decode<'a, T: Decode<'a> + ?Sized>(bytes: &'a [u8]) -> Result<T, Error> {
BUFFER.with(|b| b.borrow_mut().decode(bytes))
}
Pros:
Cons:
You could always opt out by creating a new buffer for each encode/decode call.
I just released bitcode = "=0.6.0-beta.1"
.
bitcode::Buffer
bitcode::serialize
and bitcode::deserialize
by 30-40%@caibear https://docs.rs/bitcode/0.5.0/bitcode/struct.Buffer.html#method.deserialize
So buffer.deserialize()
is now removed? I'm guessing if I want to deserialize with reusable buffer, I have to use encode/decode?
@caibear https://docs.rs/bitcode/0.5.0/bitcode/struct.Buffer.html#method.deserialize So
buffer.deserialize()
is now removed? I'm guessing if I want to deserialize with reusable buffer, I have to use encode/decode?
Yes this was removed. Currently you have to use Encode/Decode if you want to reuse allocations.
Buffer::{serialize, deserialize}
might be reimplemented in a future version, but doing so is non-trivial.
Note: Saving allocations is an optimization that's usually 10% faster on large messages and 50% faster on small messages. If you care about speed, you probably want to use Encode/Decode anyway because they're usually 2-5x faster.
@caibear Ahh I see. I'm using hecs(ecs) and other libs and would need to derive Encode/Decode on them as well so I'll stick with 0.5.0 in the meantime.
@caibear Ahh I see. I'm using hecs(ecs) and other libs and would need to derive Encode/Decode on them as well so I'll stick with 0.5.0 in the meantime.
Have you benchmarked 0.6 with allocations against 0.5 without allocations for your usecase? 0.6 with allocations might be faster if your messages are large enough.
@caibear Wow, 0.6 is really that better huh? I have not benchmarked just yet. What do you mean by "large enough" in terms of size? My game sends position updates 30 times a sec and the average size is ~3-5kilobytes
@caibear Wow, 0.6 is really that better huh? I have not benchmarked just yet. What do you mean by "large enough" in terms of size? My game sends position updates 30 times a sec and the average size is ~3-5kilobytes
0.6 is generally faster/smaller than 0.5 across all benchmarks. The question here is if the gain in speed outweighs the additional allocations. I just benchmarked deserializing 5kb of messages and 0.6 is 30% faster. I don't know your exact structs, but this should be a good baseline.
On a side note: I also benchmarked 0.6 derive and it's 7x faster than 0.6 serde.
I rewrote the entire library. docs It's been tested/fuzzed but still needs some work before release.
New features:
&str
Alpha release:
Beta release:
Full release:
bitcode::Buffer
Send + Sync#[bitcode(with_serde)]
(can only dobitcode::serialize
right now)#![forbid(unsafe_code)]
feature flag (serde only and slightly slower)&[u8]