birkenfeld / serde-pickle

Rust (de)serialization for the Python pickle format.
Apache License 2.0
185 stars 27 forks source link

A few changes to support complex pickle files #22

Open rvolgers opened 2 years ago

rvolgers commented 2 years ago

I was trying to load some pretty non-Rust-friendly pickle files with this crate (using the Value API) and, while I've mostly come to the conclusion that this crate is not really built for that, I've made some improvements which might be worth upstreaming. Consider this a request for comments on whether any of this is useful. I can split it up or make changes as needed.

This adds options which enable loading pickle files with recursive data structures (through an option to return a None when recursing instead of causing an error) and to support custom subclasses of dict, list, and set.

Custom subclasses of dict, list, and set are detected by looking for dict, list or set operations performed on an object. These objects are returned as a tuple (object_attributes, superclass_data) or, if you set an option, the object attributes will be discarded and only the superclass data will be returned. This option makes it possible to mostly treat (e.g.) subclasses of list as just lists, with the caveat that if the list is empty the object's attributes may be returned in the form of a dict instead, as we have no way to detect that an object is a list if no list operations are performed on it.

Each commit should be mostly self-contained, except that the first commit (Decode collections.defaultdict as dict) is reverted and superseded by the commit after that, since the second commit contains a different, more general solution to the same problem. I kept it around in case the more tricky solution is rejected. (Edit: this was true before a number of followup bugfix commits. Instead of rewriting history I've left them as-is, but I can fix that if wanted.)

birkenfeld commented 2 years ago

Thanks for the PR! I'll have to delay the complete review for a bit, but I like the idea of compatibility with more complex pickles.

One nitpick from first glance: the version bump should be in the minor version, since new APIs are added. Also, I've enabled the CI, and there seems to be a failing test now, which means that the default behavior has changed incompatibly?

rvolgers commented 2 years ago

Hey, thanks for taking a look!

I don't like that test, it checks that something fails which now succeeds. Should people be relying on failure as part of the API? The code already has a "do the best we can" attitude, so I would say no, but let me know what you think.

The version bump is just a quick hack. I noticed that cargo will fail to use a source override sometimes if the override and the crates.io one have the same version.

There is one thing which I was not quite sure about, so I want to make sure it's clear. The new code tries to detect the type of an object by looking at the operations performed on it. But if it e.g. a subclass of dict but the dict is empty, there are no dict operations performed, and thus the new code will not detect it as a dict subclass. This leads to the object being returned as a "normal" object, meaning its attributes are returned. The user of this library cannot tell the difference if unwrap_subclassed_data_structures is used.

This seems like a bit of a fundamental problem with the unwrap_subclassed_data_structures setting, so maybe it should be removed.

I actually have a followup PR which allows to specify the way to handle various types ahead of time in the options passed to the library. This avoids the ambiguity of auto-detection, but there are more complications and I'm not sure it's worth it.

landaire commented 1 year ago

This PR's been open for a while and while I haven't tested these changes, support for recursive structures would be very useful for a data format I'm parsing.

Edit: I actually just found that there must be a bug somewhere with the memo ref count. I took @rvolgers branch and did some experimenting -- I no longer get errors relating to recursive objects when I disable this condition (always keep the item in the memo): https://github.com/birkenfeld/serde-pickle/blob/13c0cd60239302a4efd0e806a6fb174f0667de2e/src/de.rs#L603