apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.86k stars 4.26k forks source link

[Bug]: Python schema generated types cannot be pickled #22714

Open TheNeuralBit opened 2 years ago

TheNeuralBit commented 2 years ago

What happened?

The NamedTuple types we generate in apache_beam.typehints.schemas confound pickle libraries. We work around this in many places (e.g. GeneratedClassRowTypeConstraint #22679). We should see if we can find a way to make these types picklable, and clean up the workarounds.

Making the types work with cloudpickle should be the priority.

Issue Priority

Priority: 2

Issue Component

Component: sdk-py-core

tvalentyn commented 2 years ago

Have we tried pickling these types with CloudPickle?

TheNeuralBit commented 2 years ago

Yes, I added a parameterized test that tries pickling with each library in #22679: https://github.com/apache/beam/blob/c7f64264451af12ff6c7c0ef4bc95fd7ce0f5418/sdks/python/apache_beam/typehints/schemas_test.py#L592-L605

With cloudpickle we get:

_______________________________________________________________________________________________ PickleTest_2.test_generated_class_pickle _______________________________________________________________________________________________

self = <apache_beam.typehints.schemas_test.PickleTest_2 testMethod=test_generated_class_pickle>

    def test_generated_class_pickle(self):
      schema = schema_pb2.Schema(
          id="some-uuid",
          fields=[
              schema_pb2.Field(
                  name='name',
                  type=schema_pb2.FieldType(atomic_type=schema_pb2.STRING),
              )
          ])
      user_type = named_tuple_from_schema(schema)

      self.assertEqual(
>         user_type, self.pickler.loads(self.pickler.dumps(user_type)))

apache_beam/typehints/schemas_test.py:605: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../.pyenv/versions/beam/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py:73: in dumps
    cp.dump(obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <cloudpickle.cloudpickle_fast.CloudPickler object at 0x7fc1c273c880>, obj = <class 'apache_beam.typehints.schemas.BeamSchema_some_uuid'>

    def dump(self, obj):
        try:
>           return Pickler.dump(self, obj)
E           TypeError: cannot pickle 'google.protobuf.pyext._message.MessageDescriptor' object

../../../../.pyenv/versions/beam/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py:633: TypeError
chamikaramj commented 2 years ago

Can we close this since https://github.com/apache/beam/pull/23739 was merged ?

TheNeuralBit commented 2 years ago

This is technically still an issue since dill can't pickle the types.