cloudpipe / cloudpickle

Extended pickling support for Python objects
Other
1.64k stars 167 forks source link

Random class_tracker_id for dynamic class #510

Open gmcatsf opened 1 year ago

gmcatsf commented 1 year ago

cloudpickle generates random uuids to track dynamic classes, and those random uuids are added to outputs. For example, if there is a class serialized by value, the following lines are found with pickletools.dis

  571: s                    SETITEM
  572: \x8c                 SHORT_BINUNICODE 'fa5abda803d644e0bdcfdffec5c8f8d6'
  606: \x94                 MEMOIZE    (as 56)

This string comes from class_tracker_id in cloudpickle and makes binary outputs different even though there is no code change.

Can random ids be replaced with deterministic ids, say a sequential number, for class_tracker_id?

This could be part of existing https://github.com/cloudpipe/cloudpickle/issues/453

ogrisel commented 11 months ago

@gmcatsf feel free to open a PR for that. I am not sure how the proposal of using a sequential id would pan-out in practice. We need to try and see if the existing tests pass unchanged. We also need new tests to specify what we mean by deterministic pickle files.

We might need a thread-safe counter increment, probably with a lock. We cannot rely on the GIL because we would like this code to work with the nogil fork of CPython (and also PyPy).

An alternative to sequential ids would be to hash the contents of the class def, but that might be too complex / expensive to do because if it might imply scanning the reference graph of the class object twice instead of once.