cloudpipe / cloudpickle

Extended pickling support for Python objects
Other
1.66k stars 167 forks source link

cythonized pydantic objects in __main__ cannot be pickled #408

Closed marco-neumann-by closed 3 years ago

marco-neumann-by commented 3 years ago

Abstract

The following code snipped fails with cloudpickle but works with stock pickle if pydantic is cythonized (either via a platform-specific wheel or by having cython installed when calling setup.py):

# bug.py
import cloudpickle
import pydantic
import pickle

class Bar(pydantic.BaseModel):
    a: int

pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below

When using the file via main:

$ python bug.py

The error message is:

_pickle.PicklingError: Can't pickle <cyfunction int_validator at 0x7fc6808f1040>: attribute lookup lambda12 on pydantic.validators failed

Note that the issue does NOT appear when a non-cythonized pydantic version is used.

Also note that the issue does NOT appear when the file is not __main__, for example:

$ python -c "import bug"

Environment

Technical Background

In contrast to pickle, cloudpickle pickles the actual class when it resides in __main__, see the following note in the README:

Among other things, cloudpickle supports pickling for lambda functions along with functions and classes defined interactively in the __main__ module (for instance in a script, a shell or a Jupyter notebook).

I THINK that might be the reason why this happens. What's somewhat weird is that the object in question is pydantic.validators.int_validator which CAN actually be pickled:

from pydantic.validators import int_validator
import cloudpickle
import pickle

# both work:
pickle.dumps(int_validator)
cloudpickle.dumps(int_validator)

References

This was first reported in #403 here.

ogrisel commented 3 years ago

Could you please edit the bug report to include the full traceback?

ogrisel commented 3 years ago

Also is this problem happening with the current master branch of cloudpickle?

ogrisel commented 3 years ago

I believe this was fixed by #409 as I cannot reproduce anymore. We still need to release though.

lukasmasuch commented 3 years ago

I still get the same error using the cloudpickle version from master in Python 3.8.5:

image

The fix from #409 only seems to target Python version < 3.7.

kylebarron commented 3 years ago

Edited to use cloudpickle from master

This issue should be reopened.

The difference between environments and likely why @ogrisel was unable to reproduce this is because pydantic can be installed with or without Cython support. The Cython version of Pydantic is unsurprisingly significantly faster than the pure-Python version and is also the default install (at least for platforms for which wheels exist).

Here are two examples using virtualenv that should be reproducible, using the same script as @marco-neumann-by defined initially:

# example.py
import cloudpickle
import pydantic
import pickle

class Bar(pydantic.BaseModel):
    a: int

pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below

Non-cython Pydantic

Note that the --no-binary pydantic tells pip to install without any Cython files.

virtualenv .venv
source ./.venv/bin/activate
pip install git+https://github.com/cloudpipe/cloudpickle pydantic --no-binary pydantic

Here you can tell that there are no cython files:

> ls ./.venv/lib/python3.8/site-packages/pydantic/
__init__.py           datetime_parse.py     json.py               tools.py
__pycache__           decorator.py          main.py               types.py
_hypothesis_plugin.py env_settings.py       mypy.py               typing.py
annotated_types.py    error_wrappers.py     networks.py           utils.py
class_validators.py   errors.py             parse.py              validators.py
color.py              fields.py             py.typed              version.py
dataclasses.py        generics.py           schema.py

And the example passes without issue

> python example.py
> echo $?
0

Cython-based Pydantic

Now we install pydantic without use of --no-binary pydantic.

deactivate
rm -rf .venv
virtualenv .venv
source ./.venv/bin/activate
pip install git+https://github.com/cloudpipe/cloudpickle pydantic

Now you can see that there are built C libraries included with Pydantic:

> ls ./.venv/lib/python3.8/site-packages/pydantic/
__init__.cpython-38-darwin.so           json.cpython-38-darwin.so
__init__.py                             json.py
__pycache__                             main.cpython-38-darwin.so
_hypothesis_plugin.cpython-38-darwin.so main.py
_hypothesis_plugin.py                   mypy.cpython-38-darwin.so
annotated_types.cpython-38-darwin.so    mypy.py
annotated_types.py                      networks.cpython-38-darwin.so
class_validators.cpython-38-darwin.so   networks.py
class_validators.py                     parse.cpython-38-darwin.so
color.cpython-38-darwin.so              parse.py
color.py                                py.typed
dataclasses.cpython-38-darwin.so        schema.cpython-38-darwin.so
dataclasses.py                          schema.py
datetime_parse.cpython-38-darwin.so     tools.cpython-38-darwin.so
datetime_parse.py                       tools.py
decorator.cpython-38-darwin.so          types.cpython-38-darwin.so
decorator.py                            types.py
env_settings.cpython-38-darwin.so       typing.cpython-38-darwin.so
env_settings.py                         typing.py
error_wrappers.cpython-38-darwin.so     utils.cpython-38-darwin.so
error_wrappers.py                       utils.py
errors.cpython-38-darwin.so             validators.cpython-38-darwin.so
errors.py                               validators.py
fields.cpython-38-darwin.so             version.cpython-38-darwin.so
fields.py                               version.py
generics.py

And running our example again, we can see that it fails:

> python example.py
Traceback (most recent call last):
  File "example.py", line 9, in <module>
    cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below
  File "/Users/kbarron/tmp/.venv/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/Users/kbarron/tmp/.venv/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
_pickle.PicklingError: Can't pickle <cyfunction int_validator at 0x101cf62b0>: attribute lookup lambda12 on pydantic.validators failed
kylebarron commented 3 years ago

Also note that the issue does NOT appear when the file is not __main__, for example:

I can also reproduce this, however:

# example.py
import cloudpickle
import pickle
from models import Bar

pickle.loads(pickle.dumps(Bar(a=1))) # This works well
cloudpickle.loads(cloudpickle.dumps(Bar(a=1))) # This fails with the error below
# models.py
import pydantic

class Bar(pydantic.BaseModel):
    a: int

This works fine, so a quick workaround is to always define Pydantic models in a separate file.

ericman93 commented 3 years ago

I'm still having this issue in cloudpickle 2.0.0 it is only working with non-cython Pydantic And my Pydantic models declared in a separated file

crclark commented 2 years ago

@ogrisel I am also still seeing this issue in 2.0.0. The workaround in https://github.com/cloudpipe/cloudpickle/issues/408#issuecomment-933760919 works for me, but I believe this issue should be reopened.

rjurney commented 2 years ago

I have this issue with pydantic and pyspark.

../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/pandas/map_ops.py:91: in mapInPandas
    udf_column = udf(*[self[col] for col in self.columns])
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:276: in wrapper
    return self(*args)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:249: in __call__
    judf = self._judf
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:215: in _judf
    self._judf_placeholder = self._create_judf(self.func)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:224: in _create_judf
    wrapped_func = _wrap_function(sc, func, self.returnType)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/sql/udf.py:50: in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/rdd.py:3345: in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/serializers.py:458: in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/cloudpickle/cloudpickle_fast.py:73: in dumps
    cp.dump(obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pyspark.cloudpickle.cloudpickle_fast.CloudPickler object at 0x7ff5f0410700>
obj = (<function test_graphlet_etl.<locals>.horror_to_movie at 0x7ff5d0e81480>, StructType([StructField('entity_id', StringT...ld('length', LongType(), False), StructField('gross', LongType(), False), StructField('rating', StringType(), False)]))

    def dump(self, obj):
        try:
>           return Pickler.dump(self, obj)
E           _pickle.PicklingError: Can't pickle <cyfunction str_validator at 0x7ff5b0461220>: it's not the same object as pydantic.validators.str_validator

../../opt/anaconda3/envs/graphlet/lib/python3.10/site-packages/pyspark/cloudpickle/cloudpickle_fast.py:602: PicklingError
brettc commented 2 years ago

I've just been bitten by this. @ogrisel, can we reopen this issue? The workaround is not an option if you are defining your objects inside a jupyter notebook.

CleanShot 2022-11-05 at 09 23 57@2x
simon-mo commented 2 years ago

@brettc as a workaround, you can define custom serializers to pack and unpack pydantic objects. This might help your use case.

https://github.com/ray-project/ray/blob/eed90495cedad0dc2fb6ea6d430df61e4eac24f4/python/ray/util/serialization_addons.py#L10-L35

brettc commented 2 years ago

@simon-mo thanks for the tip -- this looks very promising! The error occurs for me when I'm using dask, so I guess you had the same issues in ray. (BTW, ray is amazing. I chose dask for this job because ray seemed like overkill).

zero1zero commented 1 year ago

I'm still struggling to find a workaround for this issue. My code is not directly defining any pydantic types (although it is used by dependent libraries).

Is there a version upgrade/downgrade that might be the cause? Unclear on where the actual issue is occuring. In my case it looks to be in the chain of uvicorn and kserve:

Traceback (most recent call last):
  File "/.asdf/installs/python/3.9.11/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/Library/Caches/pypoetry/virtualenvs/truss-FUoNelHr-py3.9/lib/python3.9/site-packages/kserve/model_server.py", line 275, in servers_task
    await asyncio.gather(*servers)
  File "/Library/Caches/pypoetry/virtualenvs/truss-FUoNelHr-py3.9/lib/python3.9/site-packages/kserve/model_server.py", line 269, in serve
    server.start()
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/.asdf/installs/python/3.9.11/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <cyfunction str_validator at 0x16b57c790>: it's not the same object as pydantic.validators.str_validator
dumitrescustefan commented 1 year ago

This still happens. I have to define pydantic models in another file, otherwise I get this error. Even in a simple file where I define a pydantic param class and a Ray actor with a single method, this happens. Using the latest ray, pydantic, etc.

lesteve commented 11 months ago

I agree this issue still exists and I believe it is actually fixed in pydantic 2.5 (see issue and PR) if you run your script with Python. An issue still exists inside Jupyter/IPython https://github.com/pydantic/pydantic/issues/8232.

If you get a similar error like the one below, it likely means your are using pydantic<2 and I would say this is not super likely to get fixed in pydantic (see https://docs.pydantic.dev/latest/version-policy/#pydantic-v1):

_pickle.PicklingError: Can't pickle <cyfunction int_validator at 0x7f5cb91e01e0>: it's not the same object as pydantic.validators.int_validator

In this case, the simplest work-around seems to define your pydantic model in a separate file as noted in https://github.com/cloudpipe/cloudpickle/issues/408#issuecomment-933760919

rjurney commented 11 months ago

Can someone remind me of what it means if this is fixed? I think it means Spark can serialize numpy arrays?