SoftbearStudios / bitcode

A binary encoder/decoder for Rust
https://crates.io/crates/bitcode/
MIT License
331 stars 17 forks source link

Is it possible to encode recursive types? #35

Open artegoser opened 2 weeks ago

artegoser commented 2 weeks ago

image

image

image

In serde version there is error: type changed

finnbear commented 2 weeks ago

Recursive types are not implemented (edit: for Encode) and would be tricky to implement based on how bitcode@0.6 encodes nested values. You can use bitcode@0.5 instead, which works very differently and has a #[bitcode(recursive)] attribute for Encode.

Edit: for serde, see https://github.com/SoftbearStudios/bitcode/issues/35#issuecomment-2323156232

artegoser commented 2 weeks ago

Thx. So, is this feature not planned? Or will it be possible again someday?

finnbear commented 2 weeks ago

Not currently planned, but the documentation could be more clear about this so I'll leave the issue open.

caibear commented 1 week ago

Can use 0.6 with serde if you remove serde untagged from your enums. Serde untagged breaks most binary formats even if some don't complain right away.

artegoser commented 1 week ago

Yes, it works, only the purpose in reducing the amount of data is not fulfilled. In this case enum identifier is written, which significantly increases the file size. I compared with dlhn, however it cannot decode this. The size becomes larger than my daletpack format, which does not compress well. Is it possible to implement serde(untagged) support in your case?

finnbear commented 1 week ago

Is it possible to implement serde(untagged) support in your case?

No, because #[serde(untagged)] works by trying all possible variants until one deserializes. No version of bitcode is self-describing, so it's possible that the wrong variant would deserialize without an error but the data would be corrupt.

Consider the case of #[serde(untagged)] enum { Big(u16), Small(u8) }. If you serialized Small(5) with serde_json, it would deserialize as Big(5). In bitcode, #[serde(untagged)] enum { Int(usize), Str(String) } could have the worse problem that bytes from a string encoding could be interpreted as an integer. Data corruption is not something we tolerate in bitcode, although most likely the entire message would fail to deserialize.

Making bitcode self-describing (per-instance, to support #[serde(untagged)]) would probably cost much more data than the enum variants that #[serde(untagged)] tries to omit.

The #[serde(untagged)] feature is better for textual formats like JSON, where different types have disjoint prefixes like [ or {.

The size becomes larger than my daletpack format

If you can provide specific details (schema + encoding) of how you can outperform bitcode on size, that would be interesting :eyes:

artegoser commented 1 week ago

If you can provide specific details (schema + encoding) of how you can outperform bitcode on size, that would be interesting 👀

This format is specifically designed for my schema, but it doesn't compress well (I'm not very knowledgeable about how to optimize this), so I'm looking for solutions that are already out there. It's not for all data types, so it won't help you.

For my format, I had the idea of exposing each type (if it's an enum) and writing a separate identifier for each one.

For example

enum Enum {
  First(Other),
  Second(Other)
}

enum Other {
  String(String),
  Num(u8)
}

Enum in this case has 4 binary identifiers for each type

Enum::FirstString Enum::FirstNum

Enum::SecondString Enum::SecondNum

I don't know how procedurally realizable this is, but it should work. I don't know if this structure compresses well either.

I partially implemented this in the daletpack scheme (but still eventually would like to do it for all types and procedurally)

caibear commented 1 week ago

After running your benchmark it looks like various formats take ~300 bytes. I would recommend testing on a larger dataset if you want to accurately benchmark compression. Compression algorithms are generally inefficient in terms of size and speed on input smaller than tens of kilobytes.

Also zlib is just a wrapper of deflate.

DontBreakAlex commented 1 day ago

Hi ! I have a few types that contain serde_json::Valuess, and I also get a panic with type changed. I checked and it seems that serde_json is not using #[serde(untagged)]. What's going on ?

finnbear commented 1 day ago

I recently added a warning to the README about this; serde_json::Value has a custom Serialize implementation that is very similar to an untagged enum.