birkenfeld / serde-pickle

Rust (de)serialization for the Python pickle format.
Apache License 2.0
185 stars 27 forks source link

Deserialization 3.5x slower than Python pickle, 4x slower than serde_json #14

Open naktinis opened 3 years ago

naktinis commented 3 years ago

I set up a simple benchmark with a 67MB pickle and measured deserialization speed in 7 scenarios.

library time
Python pickle.load 341 ms
Python json.load 397 ms
serde_json from_str 327 ms
bincode from_slice 314 ms
py-marshal marshal_load 691 ms
serde-pickle from_reader 1250 ms
serde-pickle from_slice 1310 ms

Is this a known behavior? Is there any hope of this getting improved in the foreseeable future? I share my setup below, so you can point out any issues or things I missed.

Data

>>> import random, string
>>> data = [''.join(random.sample(string.ascii_letters, 32)) for _ in range(2_000_000)]

Python load

>>> import time, pickle, marshal
>>> marshal.dump(data, open('test.marshal', 'wb'))
>>> pickle.dump(data, open('test.pickle', 'wb'))
>>> json.dump(data, open('test.json', 'w'))
>>> t = time.time(); _ = pickle.load(open('test.pickle', 'rb')); print(f'{time.time() - t:.3f}s')
0.341s
>>> t = time.time(); _ = json.load(open('test.json', 'rb')); print(f'{time.time() - t:.3f}s')
0.397s

Rust load

pub fn load_pickle(path: &str) -> pickle::Value {
    let file = BufReader::new(File::open(path).unwrap());
    pickle::from_reader(file).expect("couldn't load pickle")
}

pub fn load_pickle_slice(path: &str) -> pickle::Value {
    let mut bytes = Vec::new();
    File::open(path).unwrap().read_to_end(&mut bytes).unwrap();
    pickle::from_slice(&bytes).expect("couldn't load pickle")
}

pub fn load_marshal(path: &str) -> Result<Arc<RwLock<Vec<Obj>>>, &'static str> {
    let file = BufReader::new(File::open(path).unwrap());
    match read::marshal_load(file) {
        Ok(obj) => Ok(obj.extract_list().unwrap()),
        Err(_) => Err("error_load"),
    }
}

pub fn load_json(path: &str) -> json::Value {
    let mut s = String::new();
    File::open(path).unwrap().read_to_string(&mut s).unwrap();
    serde_json::from_str(&s).expect("couldn't load json")
}

pub fn load_bincode<T>(path: &str) -> T
    where T: serde::de::DeserializeOwned
{
    let file = BufReader::new(File::open(path).unwrap());
    bincode::deserialize_from(file).unwrap()
}

fn main() {
    println!("Loading pickle...");
    let timer = time::Instant::now();
    let data = load_pickle("test.pickle");
    println!("Load completed in {:.2?}", timer.elapsed());

    println!("Loading pickle slice...");
    let timer = time::Instant::now();
    let data = load_pickle_slice("test.pickle");
    println!("Load completed in {:.2?}", timer.elapsed());

    println!("Loading marshal...");
    let timer = time::Instant::now();
    let data = load_marshal("test.marshal").unwrap();
    println!("Load completed in {:.2?}", timer.elapsed());

    println!("Loading JSON...");
    let timer = time::Instant::now();
    let data = load_json("test.json");
    println!("Load completed in {:.2?}", timer.elapsed());

    println!("Loading Bincode...");
    let timer = time::Instant::now();
    let data: Vec<String> = load_bincode("test.bincode");
    println!("Load completed in {:.2?}", timer.elapsed());
}

Dependencies

[dependencies]
serde-pickle = "0.6"
bincode = "1.3"
serde_json = "1.0"
py-marshal = { git = "https://github.com/sollyucko/py-marshal" }
serde = { version = "1.0", features = ["derive"] }
birkenfeld commented 3 years ago

Thanks for the report, I can more or less reproduce the results. (Please include all of the code next time though, it makes it much easier.)

This crate hasn't been optimized for speed (yet), so it's not surprising that it won't outperform Python's pickle module. As for a comparison between different formats, that is always a little more difficult to reason about.

In any case I can't spend much time on this at present - PRs are welcome and I expect there might be some easy wins achievable with basic profiling.