Open klaragerlei opened 3 years ago
Could this be solved by specifying the max pickler-protocol in the shuffled-analysis code, so that it saves dataframes that are backwards compatible with the 3.6 pipeline?
e.g. df.to_pickle('cat.pkl', protocol=4)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html
Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8. Refer to PEP 3154 for information about improvements brought by protocol 4.
Protocol version 5 was added in Python 3.8. It adds support for out-of-band data and speedup for in-band data. Refer to PEP 574 for information about improvements brought by protocol 5.
From: https://docs.python.org/3/library/pickle.html
This would only affect newly saved dataframes but you could write a quick 3.8 script to glob, load, and re-save your dataframes using protocol 4
(Python 3.8 does have the walrus operator so it would be nice to upgrade someday anyway...)
:=
df.to_pickle('cat.pkl', protocol=4)
I like this idea. @HDClark94 , is there any reason for using protocol 5, or would it be okay to change this?
Is your feature request related to a problem? Please describe. The pipeline uses python 3.6 and the shuffled analysis uses 3.8, so the data frame outputs of these two are not compatible, because pyhton 3.6 cannot open 3.8 pickles. This problem can be managed by having multiple virtual environments on Eleanor.
Describe the solution you'd like Update the pipeline to use 3.8
Describe alternatives you've considered Keep using the workaround. I think this will cause a lot of issues for less experienced users.