birkenfeld / serde-pickle

Rust (de)serialization for the Python pickle format.
Apache License 2.0
185 stars 27 forks source link

How to deserialize enums that fail with error "decoding error: enums must be tuples"? #8

Closed L0g4n closed 4 years ago

L0g4n commented 4 years ago

I am currently trying to improve my deserialization process by making it more robust, i.e. it should handle another pickle format where one data field is missing.

Thus, my index datastructure that is the main entry point for the deserilization looks like this:

#[derive(Debug, Deserialize, Clone)]
pub struct RpaIdx(BTreeMap<String, Vec<RpaEntryNew>>);

As you can imagine RpaEntryNew is an enum with two variants, one has three fields, and in the other variant the prefix field is missing:

#[derive(Debug, Deserialize, Clone, PartialEq)]
pub enum RpaEntryNew {
    V2(RpaEntryv2),
    V3(RpaEntryv3),
}
#[derive(Debug, Deserialize, PartialEq, Clone)]
pub struct RpaEntryv3 {
    offset: IntLen,
    len: IntLen,
    prefix: String,
}

#[derive(Debug, Deserialize, PartialEq, Clone)]
pub struct RpaEntryv2 {
    offset: IntLen,
    len: IntLen,
}

So, when I am trying to decode the whole thing, it failed as expected since I do not know how to tell the library which variant to decode to:

// FIXME: fix 'decoding error, invalid length 2' by expecting this error and then try the other struct
let deserialized_indices: RpaIdx = serde_pickle::from_slice(&decoded_bytes)?; // FIXME: tell serde somehow to first decode the first variant of the enum, if that fails, the second one

The error message is the one in the title.

Do you happen to know how to make the whole process possible? Basically trying to decode the first variant of the enum and if that fails the other one?

birkenfeld commented 4 years ago

I think you should use #[serde(untagged)] for the enum to do that. See https://serde.rs/container-attrs.html for details.

L0g4n commented 4 years ago

Thanks, that worked. Next time, I'mma first check the serde docs.

Palladinium commented 4 years ago

I disagree with the suggestion to simply use #[serde(untagged)], since serde_pickle handles the default externally-tagged enum representation differently from other serde libraries I've personally encountered.

For example, I took serde_yaml and serde_json and used their to_string function, while for serde_pickle I pickled with the pickle 3 format, unpickled in Python 3 and put the result through print().

#[derive(Serialize)]
enum Foo {
    Struct { a: i32 },
    NewType(i32),
    Tuple(i32, u32),
    Unit,
}

fn values() -> Vec<Foo> {
    vec![
        Foo::Struct { a: 1 },
        Foo::NewType(2),
        Foo::Tuple(3, 4),
        Foo::Unit,
    ]
}

JSON

[{"Struct":{"a":1}},{"NewType":2},{"Tuple":[3,4]},"Unit"]

YAML

---
- Struct:
    a: 1
- NewType: 2
- Tuple:
    - 3
    - 4
- Unit

serde-pickle

[('Struct', {'a': 1}), ('NewType', 2), ('Tuple', [3, 4]), ('Unit',)]

What I'd expect serde-pickle to do

I think serde-pickle should follow the example of serde_yaml and serde_json, and (de)serialize like:

[{'Struct':{'a':1}},{'NewType':2},{'Tuple':[3,4]},'Unit']

Furthermore, if you use a python YAML or JSON library to load the above YAML or JSON examples, it'll produce the same output I'd expect serde-pickle to emit. (This is actually my use case, where a python script and a Rust program exchange data that is also distributed as YAML files).

birkenfeld commented 4 years ago

Just for clarification, @Palladinium 's comment isn't applicable here (but it is a legitimate issue, see #9). Whether serde-pickle uses dicts or tuples for externally tagged enum representation, it would in no case work with the @L0g4n 's data without using #[serde(untagged)].

L0g4n commented 4 years ago

@birkenfeld I already thought of that; This seems more like a internal consistency issue for serde libraries.