Open feizerl opened 2 years ago
The pickle instructions you're displaying seems like reconstruction instruction for set
(more precisely, FrozenSet
) instances, which are not ordered (unlike dicts in Python > 3.7). I would imagine that pickles of set
objects created by pickle
(and not cloudpickle
) are also non-deterministic. Not sure if we want to override this behavior (which would in any way work only for the CloudPickler
inheriting from the pure-python Pickler
, and not the fast C-backed Pickler
).
We could introduce a constructor parameter to implement a slower, deterministic version of cloudpickle. But, indeed I am not sure how we could do that with the subclass of the C-implementation of the CPython pickle.Pickler
class.
Note that joblib.hash
can do that already, but it does not analyse the content of dynamically defined functions.
Hello,
@ogrisel mentioned in this comment (https://github.com/cloudpipe/cloudpickle/issues/385#issuecomment-661103789):
However, @ogrisel also pushed a PR (#428) which was released as part of cloudpickle 2.0.0 that tried to address non determinism owing to dictionary ordering.
I wanted to confirm what is the official status of the project regarding non determinism because I am still seeing non deterministic pickles in cloudpickle 2.0.0
Here is the
pickletools.dis
outputs of a function:pickle of a function on second attempt:
As you can see, the entries are all the same, but shuffled around.
This function is part of a large project, so unfortunately I can't produce a short test case right now.
Notice that kubeflow pipelines implement caching by making sure that pickle of the function hasn't changed. (there is an option to not use pickle as well, but it has its own problems). Having a non deterministic cloudpickle invalidates the cache every time making that feature useless.
Thanks.