SoftbearStudios / bitcode

A binary encoder/decoder for Rust
https://crates.io/crates/bitcode/
MIT License
334 stars 17 forks source link

Bitcode rewrite #19

Closed caibear closed 6 months ago

caibear commented 6 months ago

I rewrote the entire library. docs It's been tested/fuzzed but still needs some work before release.

# If you want to try it early
bitcode = "=0.6.0-beta.1"

New features:

Alpha release:

Beta release:

Full release:

caibear commented 6 months ago

Benchmarks for those who are interested. Previous version of bitcode isn't shown here but it has speed similar to bincode, size simliar to new bitcode, and compressed size 20% worse than bincode.

Format Compression Size (bytes) Serialize (ns) Deserialize (ns)
bincode 49.1 35 115
bincode lz4 16.1 86 115
bincode deflate-fast 13.1 166 176
bincode deflate-best 8.9 3708 141
bincode zstd-0 12.4 172 146
bincode zstd-22 8.5 32312 133
bincode-varint 22.3 36 116
bincode-varint lz4 10.8 71 119
bincode-varint deflate-fast 10.1 146 165
bincode-varint deflate-best 8.0 2664 153
bincode-varint zstd-0 8.2 123 131
bincode-varint zstd-22 7.8 23939 136
bitcode 16.9 29 104
bitcode lz4 9.8 42 108
bitcode deflate-fast 8.3 86 137
bitcode deflate-best 6.8 1661 125
bitcode zstd-0 7.1 58 115
bitcode zstd-22 6.2 24713 115
bitcode-derive 16.9 10 12
bitcode-derive lz4 9.7 25 17
bitcode-derive deflate-fast 8.3 68 45
bitcode-derive deflate-best 6.8 1597 33
bitcode-derive zstd-0 7.1 40 24
bitcode-derive zstd-22 6.2 24905 25
tbillington commented 6 months ago

doesn't require any hints (determines them at runtime)

This is remarkable, is there any particular part of the branch you could point me to so I could see how it's done?

Side question, do you see any benefits in hinting on top of runtime determined?

Hope it's okay to ask the questions in your PR :) thanks for making this library

caibear commented 6 months ago

doesn't require any hints (determines them at runtime)

This is remarkable, is there any particular part of the branch you could point me to so I could see how it's done?

https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/pack.rs https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/pack_ints.rs https://github.com/SoftbearStudios/bitcode/blob/5bdc22ba943d0ba8de092a763327b8167656611f/src/f32.rs (albiet this one requires the aid of a compression algorithm like deflate/lz4/zstd to do anything)

Side question, do you see any benefits in hinting on top of runtime determined?

After adding hints to bitcode, I didn't use them as much as I thought I would because it was tedious. The types of "packing" I'm using in this new version are designed to quickly determine if they're applicable and pack the data. I'm probably not going to add manual hints back because most people don't benefit from them (me included). Also this new version prioritizes working with general purpose compression which some of the old hints got in the way of (e.g. #[bitcode_hint(ascii)] made characters 7 bits which confused byte-wise compression algorithms).

Hope it's okay to ask the questions in your PR :) thanks for making this library

Yeah! I made this PR public before being finished to see what people think about it.

LevitatingBusinessMan commented 6 months ago

Very impressive!!!

caibear commented 6 months ago

I just released bitcode = "=0.6.0-alpha.1" which has the core features done.

forrestli74 commented 6 months ago

Good work! I read through the code and have a few thoughts:

  1. Looks like enum with more than 256 variants is not supported, will it be supported in full release?
  2. This definitely is a huge improvement over 0.5 for large messages with a lot of repeated patterns. But I'm guessing it won't do as well as 0.5 for small messages. It would be nice to measure how it compares against 0.5 in terms of speed and size for messages that have little long vec's but might have deep structures. If it's worse against 0.5 for this use case, would it make sense to keep 0.5 around and release this in a new crate?
  3. Would it make sense to optimize the case where len of a vec type is always 0? If an vec type always have len of 0, don't bother store the elements. Same thing goes for enum variant (didn't understand the code enough to know whether it already does). This is beneficial for deprecated data.
  4. I'm personally interested in this crate but want it to support schema evolution. Opened a ticket: [#21].
  5. I have a hard time imaging how recursive data can be supported. If it's hard (in terms of effort or ser/de speed), I think you might not want to support it at all.
caibear commented 6 months ago
  1. Looks like enum with more than 256 variants is not supported, will it be supported in full release?

Not unless multiple people complain about it and even then it would probably have sub optimal performance. I've been avoiding it because calculating a histogram of an unbounded number of elements would require a HashMap.

  1. This definitely is a huge improvement over 0.5 for large messages with a lot of repeated patterns. But I'm guessing it won't do as well as 0.5 for small messages. It would be nice to measure how it compares against 0.5 in terms of speed and size for messages that have little long vec's but might have deep structures.

You are right that speed is worse on small messages, however I don't see this as a problem since small messages are probably not a bottleneck anyways. The main overhead is allocating a Vec for each field, but you can keep an EncodeBuffer around to reuse allocations. On the size front, small messages might be bigger uncompressed, but smaller if you have cross message compression since 0.5 doesn't compress well.

If it's worse against 0.5 for this use case, would it make sense to keep 0.5 around and release this in a new crate?

I thought about this, but I don't want to maintain two crates. Someone can always fork 0.5 and maintain it if they want.

  1. Would it make sense to optimize the case where len of a vec type is always 0? If an vec type always have len of 0, don't bother store the elements. Same thing goes for enum variant (didn't understand the code enough to know whether it already does). This is beneficial for deprecated data.

I considered this (including integers that are always 0), but it allows for a zip bomb. Currently deserialization memory usage is proportional to input size. (Edit: actually enum variants do this, but only with the derive macro).

  1. I'm personally interested in this crate but want it to support schema evolution. Opened a ticket: [FR: allow schema evolution (in or after 0.6) #21].

See my comment.

  1. I have a hard time imaging how recursive data can be supported. If it's hard (in terms of effort or ser/de speed), I think you might not want to support it at all.

One way would be to have a max recursion depth N and have N nested encoders.

forrestli74 commented 6 months ago
  1. I think enum with more 256 variants can be easily supported with a compile time IF branch
  2. For 0.5 vs 0.6 on small messages, is there a benchmark to measure this?
  3. For vec with only zero length, I think zip bomb can still be avoided. I'll give an example:
    struct A {
    ... // a lot of nested structures
    }
    struct B(Vec<Inner>);
    fn serialize(b: &B) {...}

    If I understand correctly, if b.0.len() is always 0, we would still have a lot of data storing A's definition, which is unnessesary. We still store b.0.len() as 0, which takes some space, so zip bomb is not possible.

  4. IMO, zip bomb should not be a consideration on encoding spec, but on implementation. Interesting read on how capnproto handles zip bomb: https://capnproto.org/encoding.html#amplification-attack
caibear commented 6 months ago
  1. I think enum with more 256 variants can be easily supported with a compile time IF branch

Yes, certainly for derive (although this would require some code duplication). The tricky part is serde, where you don't know the total number of variants ahead of time (I suppose it could always encode them as u32 or something but it's not elegant).

  1. For 0.5 vs 0.6 on small messages, is there a benchmark to measure this?

No, because they're both called bitcode and they don't like to be imported at the same time. My comments about compressed size of small messages are based on us testing both version on our projects and eyeballing the results. Feel free to do your own benchmarks and share the results here!

  1. For vec with only zero length, I think zip bomb can still be avoided. I'll give an example:
struct A {
  ... // a lot of nested structures
}
struct B(Vec<Inner>);
fn serialize(b: &B) {...}

If I understand correctly, if b.0.len() is always 0, we would still have a lot of data storing A's definition, which is unnessesary. We still store b.0.len() as 0, which takes some space, so zip bomb is not possible.

Consider Vec<Vec<u8>> where all the inner Vecs are 0 length.

  1. IMO, zip bomb should not be a consideration on encoding spec, but on implementation. Interesting read on how capnproto handles zip bomb: https://capnproto.org/encoding.html#amplification-attack

Yeah this isn't a bad idea, but I didn't really want to worry about this for the initial version. We should revisit this once 0.6 is released.

forrestli74 commented 6 months ago

Consider Vec<Vec> where all the inner Vecs are 0 length.

Let's use Vec<Vec<A>> as example. This still isn't a problem, say it's [[],[],[]]. We store 3, 0, 0, 0, but we shouldn't store anything about A. Maybe I'm misunderstanding it and it's the current behavior?

caibear commented 6 months ago

Consider Vec where all the inner Vecs are 0 length.

Let's use Vec<Vec<A>> as example. This still isn't a problem, say it's [[],[],[]]. We store 3, 0, 0, 0, but we shouldn't store anything about A. Maybe I'm misunderstanding it and it's the current behavior?

Oh, nothing is currently stored for A even though the encoders exist. I thought you meant the 0s wouldn't get stored because all the Vecs are empty.

tbillington commented 6 months ago

I'm curious how you're testing auto vectorisation. Are you manually enabling cpu features with rustflags, or using something like target native?

the alpha version on benchmarks looks nuts :)

caibear commented 6 months ago

I'm curious how you're testing auto vectorisation. Are you manually enabling cpu features with rustflags, or using something like target native?

Just using target-cpu=native .cargo/config.toml. x86_64 always has SSE2, so even without target-cpu=native there is still some autovectorization.

The autovectorization mainly improves encode_vectored by making it require about half the mov instructions. LLVM can replace scalar load+store with scalar load+vectored store since encode_vectored chunks 64 of each field a time. Loads are scalar because As aren't adjacent in memory when encoding Vec<(A, B)>, but output looks like [A, A, A, ...] so vectored store can be used.

the alpha version on benchmarks looks nuts :)

Thanks :)

caibear commented 6 months ago

I also plan to utilize autovectorization in decode with something like decode_vectored_in_place but it's currently limited by the inability to decode enums in place with MaybeUninit. Hopefully Rust will add a way to initialize the variant of an uninitialized enum.

caibear commented 6 months ago

I'm interested in hearing opinions about making bitcode::encode/bitcode::decode use a thread local bitcode::Buffer. This is how we internally use bitcode so I'm considering upstreaming it.

thread_local! {
    static BUFFER: std::cell::RefCell<Buffer> = Default::default();
}
pub fn encode<T: Encode + ?Sized>(t: &T) -> Vec<u8> {
    BUFFER.with(|b| b.borrow_mut().encode(t).to_vec())
}
pub fn decode<'a, T: Decode<'a> + ?Sized>(bytes: &'a [u8]) -> Result<T, Error> {
    BUFFER.with(|b| b.borrow_mut().decode(bytes))
}

Pros:

Cons:

You could always opt out by creating a new buffer for each encode/decode call.

caibear commented 6 months ago

I just released bitcode = "=0.6.0-beta.1".

jestarray commented 4 months ago

@caibear https://docs.rs/bitcode/0.5.0/bitcode/struct.Buffer.html#method.deserialize So buffer.deserialize() is now removed? I'm guessing if I want to deserialize with reusable buffer, I have to use encode/decode?

caibear commented 4 months ago

@caibear https://docs.rs/bitcode/0.5.0/bitcode/struct.Buffer.html#method.deserialize So buffer.deserialize() is now removed? I'm guessing if I want to deserialize with reusable buffer, I have to use encode/decode?

Yes this was removed. Currently you have to use Encode/Decode if you want to reuse allocations. Buffer::{serialize, deserialize} might be reimplemented in a future version, but doing so is non-trivial.

Note: Saving allocations is an optimization that's usually 10% faster on large messages and 50% faster on small messages. If you care about speed, you probably want to use Encode/Decode anyway because they're usually 2-5x faster.

jestarray commented 4 months ago

@caibear Ahh I see. I'm using hecs(ecs) and other libs and would need to derive Encode/Decode on them as well so I'll stick with 0.5.0 in the meantime.

caibear commented 4 months ago

@caibear Ahh I see. I'm using hecs(ecs) and other libs and would need to derive Encode/Decode on them as well so I'll stick with 0.5.0 in the meantime.

Have you benchmarked 0.6 with allocations against 0.5 without allocations for your usecase? 0.6 with allocations might be faster if your messages are large enough.

jestarray commented 4 months ago

@caibear Wow, 0.6 is really that better huh? I have not benchmarked just yet. What do you mean by "large enough" in terms of size? My game sends position updates 30 times a sec and the average size is ~3-5kilobytes

caibear commented 4 months ago

@caibear Wow, 0.6 is really that better huh? I have not benchmarked just yet. What do you mean by "large enough" in terms of size? My game sends position updates 30 times a sec and the average size is ~3-5kilobytes

0.6 is generally faster/smaller than 0.5 across all benchmarks. The question here is if the gain in speed outweighs the additional allocations. I just benchmarked deserializing 5kb of messages and 0.6 is 30% faster. I don't know your exact structs, but this should be a good baseline.

On a side note: I also benchmarked 0.6 derive and it's 7x faster than 0.6 serde.