Closed JDarDagran closed 5 months ago
@mobuchowski can you take a look?
Hey @RNHTTR - this is the result of our internal investigation. We think someone knowledgeable with Celery might be able to help - if not, then we'd love to go with #39235 as this solves the issue in our tests.
Apache Airflow Provider(s)
celery, openlineage
Versions of Apache Airflow Providers
No response
Apache Airflow version
2.9.0 but happens on 2.7+ too
Operating System
Darwin MacBook-Pro.local 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64
Deployment
Other
Deployment details
The issue can be reproduced in all environments, both in local with breeze and cloud deployment, e.g. Astro Cloud.
What happened
OpenLineage listener hooks on DagRun state changes via
on_dag_run_running/failed/success
. When OL events are emitted via HTTP in large scale the scheduler hangs and needs restart. The issue appears to be happening only withCeleryExecutor
.This couldn't be reproduced when disabling OpenLineage (with [openlineage] disabled = True) or with any other OpenLineage transport that doesn't use HTTP. I also experimented with using raw
urllib3
orhttpx
as alternative torequests
. All of the experiments produced the same bug resulting in Scheduler hanging.What you think should happen instead
When reproducing with local breeze setup with CeleryExecutor there’s this strange behaviour:
htop
:lsof | grep CLOSE_WAIT
:Stack from main loop of scheduler:
Stack from one of the child spawned scheduler processes
This points to fact that the bug has probably something to do with file descriptors not being closed properly. I did not follow the whole logic how scheduler spawns child processes but before scheduler getting stuck it spawns some child processes closing properly and not causing the issue.
How to reproduce
dynamic_dag.jinja2
template:files/dags
Airflow directory, e.g. with following script:trigger_dag
infiles/dags
:Host local server (skil if you want to use some other, e.g.
httpbin.org
):Flask server in
app.py
requirements.txt
Dockerfile
Build image with:
docker build . -t test-server
Run with
docker run -eFLASK_RUN_PORT=5000 -eFLASK_APP=[app.py](http://app.py/) --rm -p 5000:8080 test-server
In
files/airflow-breeze-config/variables.env
put following entries:example for httpbin.org would be to set below instead of
OPENLINEAGE_URL
:Run breeze with
breeze start-airflow --executor CeleryExecutor
and triggertrigger_dag
. Wait until scheduler hangs and DAGs get stuck in queued/running state.Anything else
The issue is reproducible with a considerable amount of concurrent DagRuns (100+ in my case, maybe it depends on the setup). I also did some tests with raw
threading
only and/or skipping OpenLineage dependencies at all (sending requests directly fromon_dag_run_...
hooks.I am willing to submit a PR that potentially fixes the bug - the change is to use
ProcessPoolExecutor
instead ofThreadPoolExecutor
. This seems to fix it completely and not break anything else.I could not find the root cause for it eventually. My guts are saying it has something to do with threading in Python + some Celery dependency on file descriptors that produces the bug.
Are you willing to submit PR?
Code of Conduct