Bitcode rewrite - Githubissues

caibear commented 9 months ago

I rewrote the entire library. docs It's been tested/fuzzed but still needs some work before release.

# If you want to try it early
bitcode = "=0.6.0-beta.1"

New features:

much faster
very compressible
doesn't require any hints (determines them at runtime)
deserialize &str

Alpha release:

[x] big endian platforms (compile error right now)
[x] signed integer size reduction (currently treated as unsigned)
[x] usize
[x] Result

Beta release:

[x] handle derive macro errors instead of panicking
[x] remove lifetime bound on DecodeBuffer
[x] fix documented unsound code in serde impl

Full release:

[x] make bitcode::Buffer Send + Sync
[ ] recursive types
[ ] #[bitcode(with_serde)] (can only do bitcode::serialize right now)
[ ] #![forbid(unsafe_code)] feature flag (serde only and slightly slower)
[ ] CString
[x] IpAddr (see https://github.com/SoftbearStudios/bitcode/pull/30)
[ ] deserialize &[u8]

caibear commented 9 months ago

Benchmarks for those who are interested. Previous version of bitcode isn't shown here but it has speed similar to bincode, size simliar to new bitcode, and compressed size 20% worse than bincode.

Format	Compression	Size (bytes)	Serialize (ns)	Deserialize (ns)
bincode		49.1	35	115
bincode	lz4	16.1	86	115
bincode	deflate-fast	13.1	166	176
bincode	deflate-best	8.9	3708	141
bincode	zstd-0	12.4	172	146
bincode	zstd-22	8.5	32312	133
bincode-varint		22.3	36	116
bincode-varint	lz4	10.8	71	119
bincode-varint	deflate-fast	10.1	146	165
bincode-varint	deflate-best	8.0	2664	153
bincode-varint	zstd-0	8.2	123	131
bincode-varint	zstd-22	7.8	23939	136
bitcode		16.9	29	104
bitcode	lz4	9.8	42	108
bitcode	deflate-fast	8.3	86	137
bitcode	deflate-best	6.8	1661	125
bitcode	zstd-0	7.1	58	115
bitcode	zstd-22	6.2	24713	115
bitcode-derive		16.9	10	12
bitcode-derive	lz4	9.7	25	17
bitcode-derive	deflate-fast	8.3	68	45
bitcode-derive	deflate-best	6.8	1597	33
bitcode-derive	zstd-0	7.1	40	24
bitcode-derive	zstd-22	6.2	24905	25

tbillington commented 8 months ago

doesn't require any hints (determines them at runtime)

This is remarkable, is there any particular part of the branch you could point me to so I could see how it's done?

Side question, do you see any benefits in hinting on top of runtime determined?

Hope it's okay to ask the questions in your PR :) thanks for making this library

caibear commented 8 months ago

doesn't require any hints (determines them at runtime)

This is remarkable, is there any particular part of the branch you could point me to so I could see how it's done?

https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/pack.rs https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/pack_ints.rs https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/f32.rs (albiet this one requires the aid of a compression algorithm like deflate/lz4/zstd to do anything)

Side question, do you see any benefits in hinting on top of runtime determined?

After adding hints to bitcode, I didn't use them as much as I thought I would because it was tedious. The types of "packing" I'm using in this new version are designed to quickly determine if they're applicable and pack the data. I'm probably not going to add manual hints back because most people don't benefit from them (me included). Also this new version prioritizes working with general purpose compression which some of the old hints got in the way of (e.g. #[bitcode_hint(ascii)] made characters 7 bits which confused byte-wise compression algorithms).

Hope it's okay to ask the questions in your PR :) thanks for making this library

Yeah! I made this PR public before being finished to see what people think about it.

LevitatingBusinessMan commented 8 months ago

Very impressive!!!

caibear commented 8 months ago

I just released bitcode = "=0.6.0-alpha.1" which has the core features done.

forrestli74 commented 8 months ago

Good work! I read through the code and have a few thoughts:

Looks like enum with more than 256 variants is not supported, will it be supported in full release?
This definitely is a huge improvement over 0.5 for large messages with a lot of repeated patterns. But I'm guessing it won't do as well as 0.5 for small messages. It would be nice to measure how it compares against 0.5 in terms of speed and size for messages that have little long vec's but might have deep structures. If it's worse against 0.5 for this use case, would it make sense to keep 0.5 around and release this in a new crate?
Would it make sense to optimize the case where len of a vec type is always 0? If an vec type always have len of 0, don't bother store the elements. Same thing goes for enum variant (didn't understand the code enough to know whether it already does). This is beneficial for deprecated data.
I'm personally interested in this crate but want it to support schema evolution. Opened a ticket: [#21].
I have a hard time imaging how recursive data can be supported. If it's hard (in terms of effort or ser/de speed), I think you might not want to support it at all.

caibear commented 8 months ago

Looks like enum with more than 256 variants is not supported, will it be supported in full release?

Not unless multiple people complain about it and even then it would probably have sub optimal performance. I've been avoiding it because calculating a histogram of an unbounded number of elements would require a HashMap.

This definitely is a huge improvement over 0.5 for large messages with a lot of repeated patterns. But I'm guessing it won't do as well as 0.5 for small messages. It would be nice to measure how it compares against 0.5 in terms of speed and size for messages that have little long vec's but might have deep structures.

You are right that speed is worse on small messages, however I don't see this as a problem since small messages are probably not a bottleneck anyways. The main overhead is allocating a Vec for each field, but you can keep an EncodeBuffer around to reuse allocations. On the size front, small messages might be bigger uncompressed, but smaller if you have cross message compression since 0.5 doesn't compress well.

If it's worse against 0.5 for this use case, would it make sense to keep 0.5 around and release this in a new crate?

I thought about this, but I don't want to maintain two crates. Someone can always fork 0.5 and maintain it if they want.

Would it make sense to optimize the case where len of a vec type is always 0? If an vec type always have len of 0, don't bother store the elements. Same thing goes for enum variant (didn't understand the code enough to know whether it already does). This is beneficial for deprecated data.

I considered this (including integers that are always 0), but it allows for a zip bomb. Currently deserialization memory usage is proportional to input size. (Edit: actually enum variants do this, but only with the derive macro).

I'm personally interested in this crate but want it to support schema evolution. Opened a ticket: [FR: allow schema evolution (in or after 0.6) #21].

See my comment.

I have a hard time imaging how recursive data can be supported. If it's hard (in terms of effort or ser/de speed), I think you might not want to support it at all.

One way would be to have a max recursion depth N and have N nested encoders.

forrestli74 commented 8 months ago

I think enum with more 256 variants can be easily supported with a compile time IF branch
For 0.5 vs 0.6 on small messages, is there a benchmark to measure this?
For vec with only zero length, I think zip bomb can still be avoided. I'll give an example:
```
struct A {
... // a lot of nested structures
}
struct B(Vec<Inner>);
fn serialize(b: &B) {...}
```
If I understand correctly, if b.0.len() is always 0, we would still have a lot of data storing A's definition, which is unnessesary. We still store b.0.len() as 0, which takes some space, so zip bomb is not possible.
IMO, zip bomb should not be a consideration on encoding spec, but on implementation. Interesting read on how capnproto handles zip bomb: https://capnproto.org/encoding.html#amplification-attack

caibear commented 8 months ago

I think enum with more 256 variants can be easily supported with a compile time IF branch

Yes, certainly for derive (although this would require some code duplication). The tricky part is serde, where you don't know the total number of variants ahead of time (I suppose it could always encode them as u32 or something but it's not elegant).

For 0.5 vs 0.6 on small messages, is there a benchmark to measure this?

No, because they're both called bitcode and they don't like to be imported at the same time. My comments about compressed size of small messages are based on us testing both version on our projects and eyeballing the results. Feel free to do your own benchmarks and share the results here!

For vec with only zero length, I think zip bomb can still be avoided. I'll give an example:
struct A {
  ... // a lot of nested structures
}
struct B(Vec<Inner>);
fn serialize(b: &B) {...}
If I understand correctly, if b.0.len() is always 0, we would still have a lot of data storing A's definition, which is unnessesary. We still store b.0.len() as 0, which takes some space, so zip bomb is not possible.

Consider Vec<Vec<u8>> where all the inner Vecs are 0 length.

IMO, zip bomb should not be a consideration on encoding spec, but on implementation. Interesting read on how capnproto handles zip bomb: https://capnproto.org/encoding.html#amplification-attack

Yeah this isn't a bad idea, but I didn't really want to worry about this for the initial version. We should revisit this once 0.6 is released.

forrestli74 commented 8 months ago

Consider Vec<Vec> where all the inner Vecs are 0 length.

Let's use Vec<Vec<A>> as example. This still isn't a problem, say it's [[],[],[]]. We store 3, 0, 0, 0, but we shouldn't store anything about A. Maybe I'm misunderstanding it and it's the current behavior?

caibear commented 8 months ago

Consider Vec where all the inner Vecs are 0 length.

Let's use Vec<Vec<A>> as example. This still isn't a problem, say it's [[],[],[]]. We store 3, 0, 0, 0, but we shouldn't store anything about A. Maybe I'm misunderstanding it and it's the current behavior?

Oh, nothing is currently stored for A even though the encoders exist. I thought you meant the 0s wouldn't get stored because all the Vecs are empty.

tbillington commented 8 months ago

I'm curious how you're testing auto vectorisation. Are you manually enabling cpu features with rustflags, or using something like target native?

the alpha version on benchmarks looks nuts :)

caibear commented 8 months ago

I'm curious how you're testing auto vectorisation. Are you manually enabling cpu features with rustflags, or using something like target native?

Just using target-cpu=native .cargo/config.toml. x86_64 always has SSE2, so even without target-cpu=native there is still some autovectorization.

The autovectorization mainly improves encode_vectored by making it require about half the mov instructions. LLVM can replace scalar load+store with scalar load+vectored store since encode_vectored chunks 64 of each field a time. Loads are scalar because As aren't adjacent in memory when encoding Vec<(A, B)>, but output looks like [A, A, A, ...] so vectored store can be used.

the alpha version on benchmarks looks nuts :)

Thanks :)

caibear commented 8 months ago

I also plan to utilize autovectorization in decode with something like decode_vectored_in_place but it's currently limited by the inability to decode enums in place with MaybeUninit. Hopefully Rust will add a way to initialize the variant of an uninitialized enum.

caibear commented 8 months ago

I'm interested in hearing opinions about making bitcode::encode/bitcode::decode use a thread local bitcode::Buffer. This is how we internally use bitcode so I'm considering upstreaming it.

thread_local! {
    static BUFFER: std::cell::RefCell<Buffer> = Default::default();
}
pub fn encode<T: Encode + ?Sized>(t: &T) -> Vec<u8> {
    BUFFER.with(|b| b.borrow_mut().encode(t).to_vec())
}
pub fn decode<'a, T: Decode<'a> + ?Sized>(bytes: &'a [u8]) -> Result<T, Error> {
    BUFFER.with(|b| b.borrow_mut().decode(bytes))
}

Pros:

Small messages encode/decode 2x faster
Large messages encode 20% faster and decode 10% faster

Cons:

Keeps large amounts of memory when encoding/decoding a single large message
Unexpected behavior

You could always opt out by creating a new buffer for each encode/decode call.

caibear commented 8 months ago

I just released bitcode = "=0.6.0-beta.1".

reimplement bitcode::Buffer
fix unsound code in serde impl
optimize bitcode::serialize and bitcode::deserialize by 30-40%

jestarray commented 6 months ago

@caibear https://docs.rs/bitcode/0.5.0/bitcode/struct.Buffer.html#method.deserialize So buffer.deserialize() is now removed? I'm guessing if I want to deserialize with reusable buffer, I have to use encode/decode?

caibear commented 6 months ago

@caibear https://docs.rs/bitcode/0.5.0/bitcode/struct.Buffer.html#method.deserialize So buffer.deserialize() is now removed? I'm guessing if I want to deserialize with reusable buffer, I have to use encode/decode?

Yes this was removed. Currently you have to use Encode/Decode if you want to reuse allocations. Buffer::{serialize, deserialize} might be reimplemented in a future version, but doing so is non-trivial.

Note: Saving allocations is an optimization that's usually 10% faster on large messages and 50% faster on small messages. If you care about speed, you probably want to use Encode/Decode anyway because they're usually 2-5x faster.

jestarray commented 6 months ago

@caibear Ahh I see. I'm using hecs(ecs) and other libs and would need to derive Encode/Decode on them as well so I'll stick with 0.5.0 in the meantime.

caibear commented 6 months ago

@caibear Ahh I see. I'm using hecs(ecs) and other libs and would need to derive Encode/Decode on them as well so I'll stick with 0.5.0 in the meantime.

Have you benchmarked 0.6 with allocations against 0.5 without allocations for your usecase? 0.6 with allocations might be faster if your messages are large enough.

jestarray commented 6 months ago

@caibear Wow, 0.6 is really that better huh? I have not benchmarked just yet. What do you mean by "large enough" in terms of size? My game sends position updates 30 times a sec and the average size is ~3-5kilobytes

caibear commented 6 months ago

@caibear Wow, 0.6 is really that better huh? I have not benchmarked just yet. What do you mean by "large enough" in terms of size? My game sends position updates 30 times a sec and the average size is ~3-5kilobytes

0.6 is generally faster/smaller than 0.5 across all benchmarks. The question here is if the gain in speed outweighs the additional allocations. I just benchmarked deserializing 5kb of messages and 0.6 is 30% faster. I don't know your exact structs, but this should be a good baseline.

On a side note: I also benchmarked 0.6 derive and it's 7x faster than 0.6 serde.

SoftbearStudios / bitcode

Bitcode rewrite #19