Closed mshalak-nix closed 4 years ago
Thanks for opening your first issue here! Be sure to follow the issue template!
Is there a particular endpoint that is slow in the Webserver ?
Looks like I figured it out. The issue was with prometheus exporter I used, here is the fix: https://github.com/robinhood/airflow-prometheus-exporter/pull/25 It generated wrong joins, which led to one metric repetition multiple times. Since the Prometheus endpoint was served by a web server, it constantly had to execute DB query which produced thousands of metric duplicates (depending on amount of tasks in DB), that's why web server was so slow. So closing this issue.
Apache Airflow version: 1.10.9
**Kubernetes version: 1.15.11
Environment:
What happened:
I started to notice that Airflow webserver performance degrades over time. At one moment all my webserver pods had 100% CPU used, so UI and API became too slow, even unusable. I started to investigate this issue and looks like it depends on amount of task instances stored in DB. I had 100,000+ of them in the task_instance table. When I checked what process took full CPU on webserver pods, that was a gunicorn. After I cleaned up task_instance table, CPU usage dropped to nothing. Right now I have 30,000 tasks completed and see CPU spikes again. Also this issue seems to go away if I disable all my DAGs. Looks like webserver has some query which executes from time to time and consumes all the CPU.
What you expected to happen:
Amount of completed tasks do not influence the webserver performance.
How to reproduce it:
Generate 100,000 tasks and have 10 enabled DAGs in Airflow. Webserver CPU usage will be high, web becomes unusable.