Webserver becomes unusable after 100,000 tasks were completed

mshalak-nix commented 4 years ago

Apache Airflow version: 1.10.9

**Kubernetes version: 1.15.11

Environment:

Cloud provider: Google Cloud

What happened:

I started to notice that Airflow webserver performance degrades over time. At one moment all my webserver pods had 100% CPU used, so UI and API became too slow, even unusable. I started to investigate this issue and looks like it depends on amount of task instances stored in DB. I had 100,000+ of them in the task_instance table. When I checked what process took full CPU on webserver pods, that was a gunicorn. After I cleaned up task_instance table, CPU usage dropped to nothing. Right now I have 30,000 tasks completed and see CPU spikes again. Screenshot from 2020-05-07 10-43-27 Also this issue seems to go away if I disable all my DAGs. Looks like webserver has some query which executes from time to time and consumes all the CPU.

What you expected to happen:

Amount of completed tasks do not influence the webserver performance.

How to reproduce it:

Generate 100,000 tasks and have 10 enabled DAGs in Airflow. Webserver CPU usage will be high, web becomes unusable.

boring-cyborg[bot] commented 4 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

kaxil commented 4 years ago

Is there a particular endpoint that is slow in the Webserver ?

mshalak-nix commented 4 years ago

Looks like I figured it out. The issue was with prometheus exporter I used, here is the fix: https://github.com/robinhood/airflow-prometheus-exporter/pull/25 It generated wrong joins, which led to one metric repetition multiple times. Since the Prometheus endpoint was served by a web server, it constantly had to execute DB query which produced thousands of metric duplicates (depending on amount of tasks in DB), that's why web server was so slow. So closing this issue.

apache / airflow

Webserver becomes unusable after 100,000 tasks were completed #8760