apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Bug]: Python 3.12 in-compatibility of Apache Beam #32617

Open potiuk opened 1 day ago

potiuk commented 1 day ago

What happened?

I would like to report that Python 3.12 support for Apache Beam is a bit broken due to Python SDK depending on old version of dill (and cloudpickle as well but that's not likely a blocker)

Currently in Apache Airlfow, the beam provider is disabled for Python 3.12, because adding Apache Beam with it's dependencies made it impossible to have non-conflicting dependencies. After the last release of Apache Beam (2.59.0) - I was hoping all the problems with Python 3.12 were solved, and attempted to rebase the PR bringing back Beam provider to Python 3.12, but - unfortunately our tests had shown that there is one more conflict left.

You can see a failing build here https://github.com/apache/airflow/actions/runs/11121136124/job/30899938977?pr=41990 and PR to bring beam back is https://github.com/apache/airflow/pull/42505.

The failing tests are not beam tests - there are tests that test "dill" serialization for Airflow Python Virtualenv Operator and the error is this:

INFO     airflow.utils.process_utils:process_utils.py:190 Output:
INFO     airflow.utils.process_utils:process_utils.py:194 Traceback (most recent call last):
INFO     airflow.utils.process_utils:process_utils.py:194   File "/tmp/venv-callsdqfisel/script.py", line 72, in <module>
INFO     airflow.utils.process_utils:process_utils.py:194     arg_dict = dill.load(file)
INFO     airflow.utils.process_utils:process_utils.py:194                ^^^^^^^^^^^^^^^
INFO     airflow.utils.process_utils:process_utils.py:194   File "/usr/local/lib/python3.12/site-packages/dill/_dill.py", line 270, in load
INFO     airflow.utils.process_utils:process_utils.py:194     return Unpickler(file, ignore=ignore, **kwds).load()
INFO     airflow.utils.process_utils:process_utils.py:194            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO     airflow.utils.process_utils:process_utils.py:194   File "/usr/local/lib/python3.12/site-packages/dill/_dill.py", line 472, in load
INFO     airflow.utils.process_utils:process_utils.py:194     obj = StockUnpickler.load(self)
INFO     airflow.utils.process_utils:process_utils.py:194           ^^^^^^^^^^^^^^^^^^^^^^^^^
INFO     airflow.utils.process_utils:process_utils.py:194 TypeError: code() argument 13 must be str, not int

The analysis of the issue shown that the problem is with the dill version Apache Beam expects is not compatible with Python 3.12 and produces this error. Before re-enabling Beam for Python 3.12, the tests were passing on Python 3.12 and dill version used was 0.3.9, but apache beam has very strict requirement for dill version.

This is what happen when we add Apache Beam to Python 3.12 environment:

> apache-beam==2.59.0
145c146
< cloudpickle==3.0.0
---
> cloudpickle==2.2.1
167c168
< dill==0.3.9
---
> dill==0.3.1.1

And it's caused by this limitation:

          # Dill doesn't have forwards-compatibility guarantees within minor
          # version. Pickles created with a new version of dill may not unpickle
          # using older version of dill. It is best to use the same version of
          # dill on client and server, therefore list of allowed versions is
          # very narrow. See: https://github.com/uqfoundation/dill/issues/341.
          'dill>=0.3.1.1,<0.3.2',

Also cloudpickle is downgraded to 2.2.1 due to this limitation:

          # It is prudent to use the same version of pickler at job submission
          # and at runtime, therefore bounds need to be tight.
          # To avoid depending on an old dependency, update the minor version on
          # every Beam release, see: https://github.com/apache/beam/issues/23119
          'cloudpickle~=2.2.1',

But cloudpickle is not as problematic as dill is in this case - simply because the old version of dill does not properly support Python 3.12.

It would be great if the next release of Apache Beam bumps at least dill to latest version (and possibly cloudpickle) - as this would allow finally to make Apache Beam provider in Airflow to have Python 3.12 support.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

liferoad commented 1 day ago

@tvalentyn @claudevdm

Abacn commented 1 day ago

from the histrory (e.g. #21898) looks like upgrade dill is not trivial.

potiuk commented 20 hours ago

from the histrory (e.g. #21898) looks like upgrade dill is not trivial.

Can we help somehow with that ?

liferoad commented 18 hours ago

Feel free to take the issue. And we are also working on improving cloudpickle (e.g., https://github.com/apache/beam/issues/26209) and we hope we can make cloudpickle default in the future.

potiuk commented 18 hours ago

Feel free to take the issue. And we are also working on improving cloudpickle (e.g., https://github.com/apache/beam/issues/26209) and we hope we can make cloudpickle default in the future.

Unfortunately I know very little of Beam's dill usage, I might attempt to move it up but I am not sure if leading PR by me will be more help than a burden :). But I might try if you think it is a good idea.

tvalentyn commented 15 hours ago

But cloudpickle is not as problematic as dill is in this case - simply because the old version of dill does not properly support Python 3.12.

It would be great if the next release of Apache Beam bumps at least dill to latest version

It is a non-trivial change and I don't recommend that route, we won't be able to merge such change. We can try to monkey-patch dill 0.3.1.1 on Python 3.12. Beam has this change: https://github.com/apache/beam/blob/2fb9efc6e52697ad9a0ae06f81e2672365179b3c/sdks/python/apache_beam/internal/dill_pickler.py#L67 - is this patch applied before airflow test is run or not?

(and possibly cloudpickle) - as this would allow finally to make Apache Beam provider in Airflow to have Python 3.12 support.

Upgrading cloudpickle to next major version should be doable we might get to it soon, but not before next release. We are finally making some progress to switch to cloudpickle.

potiuk commented 13 hours ago
  • is this patch applied before airflow test is run or not?

We are using apache-beam 2.59.0 for those tests (latest released) and from what I see "@dill.register(CodeType)" is part of it. But I think it explains exactly what happens.

The problem in this case that apache-beam has required dependency and limits the dill version to 0.3.1.1 - and even if beam itself monkey-patches it in their code, it does not mean that any other user of dill with the same virtualenv will make use of that patching. In our case - we have single venv where our users potentially install multiple providers - beam being one of them. Which means that any other provider (or airlfow core) will have dill 0.3.1.1 installed as forced by Beam. But if the task that you run does not use beam, it will never import apache.beam provider code an apache-beam package, so dill will not be monkey-patched.

In case of the failing build here: https://github.com/apache/airflow/actions/runs/11121136124/job/30899938977?pr=41990 - you can see that it's not "beam" tests that fail, those are "PythonVirtualenv" tests that fail (and this happens only for Python 3.12 and only when apache-beam is installed, which forces downgrading of dill from 0.3.9 to 0.3.1.1. Previously those test pass successfull when dill 0.3.9 is installed (and no apache-beam is installed).

So monkey-patching possibly solves beam usage of dill, but dragging dill down to 0.3.1.1 makes other packages that do not do similar monkey-patching in the same environment fail.