WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 68 forks source link

Python: Switch default serialization method to binary cPickle. #2069

Open nemosupremo opened 6 years ago

nemosupremo commented 6 years ago

As far as I understand it, wallaroo uses the default pickle with protocol 0 for serialization by default. This is slowest version of pickle, and its much faster to binary cPickle.

As shown by http://justinfx.com/2012/07/25/python-2-7-3-serializer-speed-comparisons/ seralization can be up to 50x faster.

The serialize and deserialize methods can be written like so:

import cPickle

def serialize(o):
    return cPickle.dumps(o, -1) # -1 defaults to pickle.HIGHEST_PROTOCOL

def deserialize(bs):
    return cPickle.loads(bs)
slfritchie commented 6 years ago

I'll defer to my Wallaroo Labs pythonistas for an opinion. My $0.02 is that I've witnessed how slow the default pickle is, and it is terribly slow. Thank you for the suggestion.

A complication that I can think of right now is that a couple of Wallaroo developers are working on multithreading Python interpreters within a single Machida process. I've been hearing lots of wailing & gnashing of teeth from them. It would be fantastic if cPickle made their work easier; it might not.

aturley commented 6 years ago

I don't have an objection to using cpickle over pickle since for our use case the only difference is the speed.

The multi-threaded Python project isn't going anywhere right now, so there's nothing to worry about on that front.

cararemixed commented 6 years ago

We've had some issues with old-style classes in our decorator code breaking in the new protocol but I think we've replaced that so now it will likely work with the new protocol. cPickle is also a good default as it covers the common cases. Those that need custom pickling callbacks can use pickle explicitly.

cararemixed commented 6 years ago

Oh, also a note, I prefer the newer protocols in that they allow slotted class definitions which can make state instances more efficient in memory.

aturley commented 6 years ago

In Python3 we need to use pickle, which is actually cpickle under the hood.

Everyone seems in favor of this change. We should try changing to using cpickle and see if anything breaks. If it works we should use it.

aturley commented 6 years ago

We need to make sure this is appropriately documented when we make the change.