Open artegoser opened 2 weeks ago
Recursive types are not implemented (edit: for Encode
) and would be tricky to implement based on how bitcode@0.6
encodes nested values. You can use bitcode@0.5
instead, which works very differently and has a #[bitcode(recursive)]
attribute for Encode
.
Edit: for serde
, see https://github.com/SoftbearStudios/bitcode/issues/35#issuecomment-2323156232
Thx. So, is this feature not planned? Or will it be possible again someday?
Not currently planned, but the documentation could be more clear about this so I'll leave the issue open.
Can use 0.6 with serde if you remove serde untagged from your enums. Serde untagged breaks most binary formats even if some don't complain right away.
Yes, it works, only the purpose in reducing the amount of data is not fulfilled. In this case enum identifier is written, which significantly increases the file size. I compared with dlhn, however it cannot decode this. The size becomes larger than my daletpack format, which does not compress well. Is it possible to implement serde(untagged) support in your case?
Is it possible to implement serde(untagged) support in your case?
No, because #[serde(untagged)]
works by trying all possible variants until one deserializes. No version of bitcode
is self-describing, so it's possible that the wrong variant would deserialize without an error but the data would be corrupt.
Consider the case of #[serde(untagged)] enum { Big(u16), Small(u8) }
. If you serialized Small(5)
with serde_json
, it would deserialize as Big(5)
. In bitcode, #[serde(untagged)] enum { Int(usize), Str(String) }
could have the worse problem that bytes from a string encoding could be interpreted as an integer. Data corruption is not something we tolerate in bitcode
, although most likely the entire message would fail to deserialize.
Making bitcode
self-describing (per-instance, to support #[serde(untagged)]
) would probably cost much more data than the enum variants that #[serde(untagged)]
tries to omit.
The #[serde(untagged)]
feature is better for textual formats like JSON, where different types have disjoint prefixes like [
or {
.
The size becomes larger than my daletpack format
If you can provide specific details (schema + encoding) of how you can outperform bitcode
on size, that would be interesting :eyes:
If you can provide specific details (schema + encoding) of how you can outperform
bitcode
on size, that would be interesting 👀
This format is specifically designed for my schema, but it doesn't compress well (I'm not very knowledgeable about how to optimize this), so I'm looking for solutions that are already out there. It's not for all data types, so it won't help you.
For my format, I had the idea of exposing each type (if it's an enum) and writing a separate identifier for each one.
For example
enum Enum {
First(Other),
Second(Other)
}
enum Other {
String(String),
Num(u8)
}
Enum in this case has 4 binary identifiers for each type
Enum::FirstString Enum::FirstNum
Enum::SecondString Enum::SecondNum
I don't know how procedurally realizable this is, but it should work. I don't know if this structure compresses well either.
I partially implemented this in the daletpack scheme (but still eventually would like to do it for all types and procedurally)
After running your benchmark it looks like various formats take ~300 bytes. I would recommend testing on a larger dataset if you want to accurately benchmark compression. Compression algorithms are generally inefficient in terms of size and speed on input smaller than tens of kilobytes.
Also zlib is just a wrapper of deflate.
Hi ! I have a few types that contain serde_json::Values
s, and I also get a panic with type changed
. I checked and it seems that serde_json is not using #[serde(untagged)]
. What's going on ?
I recently added a warning to the README about this; serde_json::Value
has a custom Serialize
implementation that is very similar to an untagged enum.
In serde version there is error:
type changed