birkenfeld / serde-pickle

Rust (de)serialization for the Python pickle format.
Apache License 2.0
188 stars 28 forks source link

Functionality to deserialize large `Int` fields into `Value::Int` #30

Open tntokum opened 2 months ago

tntokum commented 2 months ago

Hi, I'm currently using serde_pickle to unpack some Python pickles, and it's working great for the most part! However, I'd like to directly deserialize fields of arbitrary-sized integers into the Value enum; currently, this doesn't seem possible (please correct me if I'm wrong). It seems like serde_json handles this sort of deserialization pretty well, and I'd like to be able to do the same with your library.

Use case: I have some data containing a lot of fields, a few of which are multi-precision integers. I want to deserialize my data directly into a struct representation, with the field corresponding to the big number containing a Value. Right now, I'm forced to deserialize the entire thing into a Value as soon as I run into the number, removing a lot of the serde ergonomics by adding a lot of parsing overhead. Here's a small Rust stub to show rather than tell:

use serde::Deserialize;
use serde_pickle::HashableValue;
use serde_pickle::Value;
use std::collections::BTreeMap;
use std::io::BufReader;

#[derive(Deserialize, Debug, PartialEq)]
struct Data {
    hello: serde_pickle::Value,
}

fn main() {
    let pickle_raw = b"\x80\x04\x95\x16\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x05hello\x94\x8a\x08\xd5\xe5\xb5\x05|\xe3\xc6\x01s.";
    let pickle_long_raw = b"\x80\x04\x95 \x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x05hello\x94\x8a\x12\x95\xec\x1bE \x95\xec\xcb\xf0t\x05>\x81=\x1a\xc0\xb2\x0es.";

    let data_value =
        serde_pickle::value_from_reader(BufReader::new(&pickle_raw[..]), Default::default())
            .unwrap();
    assert!(
        data_value
            == Value::Dict(BTreeMap::from_iter(vec![(
                HashableValue::String("hello".to_string()),
                Value::I64(128039761237894613)
            )]))
    );
    let data_value =
        serde_pickle::value_from_reader(BufReader::new(&pickle_long_raw[..]), Default::default())
            .unwrap();
    assert!(
        data_value
            == Value::Dict(BTreeMap::from_iter(vec![(
                HashableValue::String("hello".to_string()),
                Value::Int(
                    "1280397612378946138756897587658765876587669"
                        .parse()
                        .unwrap()
                )
            )]))
    );

    let data: Data =
        serde_pickle::from_reader(BufReader::new(&pickle_raw[..]), Default::default()).unwrap();
    assert!(
        data == Data {
            hello: Value::I64(128039761237894613)
        }
    );

    // the thing i really don't want to do, but doesn't panic
    let data = serde_pickle::value_from_reader(BufReader::new(&pickle_long_raw[..]), Default::default()).unwrap();

    // this panics, but the analgous call in serde_json succeeds
    let data: Data =
        serde_pickle::from_reader(BufReader::new(&pickle_long_raw[..]), Default::default())
            .unwrap();
    println!("data: {:?}", data);
}

Do you have any thoughts on this, or whether you'd support adding this functionality? I've been looking at the Deserializer myself to start getting this going. Thank you!

birkenfeld commented 2 months ago

I would have thought that this works 😬 thanks for the report, I'll have a look where the problem lies!

birkenfeld commented 1 month ago

Hmm, I can see that serde_json is doing some black magic here. I didn't understand it for now, will need some more time...

tntokum commented 1 month ago

Thanks for looking! Yeah, I'm seeing the same thing in serde_json. The arbitrary precision flag makes it look like there are basically 4 ways of parsing ints -- extra weird.

No worries if it's a bit out of reach for now, I'm working with the Value enum and it's getting the job done 😀