Open sgrzemski opened 2 years ago
It looks like this counts the number of DAG runs grouped by status since the beginning of time, meaning that CPU will climb endless over time as more history is accumulated.
https://github.com/epoch8/airflow-exporter/blob/master/airflow_exporter/prometheus_exporter.py#L110
SELECT anon_1.dag_id AS anon_1_dag_id, anon_1.task_id AS anon_1_task_id, anon_1.state AS anon_1_state, anon_1.cnt AS anon_1_cnt, dag.owners AS dag_owners
FROM (SELECT task_instance.dag_id AS dag_id, task_instance.task_id AS task_id, task_instance.state AS state, count(task_instance.dag_id) AS cnt
FROM task_instance GROUP BY task_instance.dag_id, task_instance.task_id, task_instance.state) AS anon_1 JOIN dag ON dag.dag_id = anon_1.dag_id JOIN serialized_dag ON serialized_dag.dag_id = anon_1.dag_id ORDER BY anon_1.dag_id
Merge Join (cost=69565.58..92233.43 rows=47670 width=94)
Merge Cond: ((dag.dag_id)::text = (task_instance.dag_id)::text)
-> Sort (cost=133.50..133.88 rows=151 width=81)
Sort Key: dag.dag_id
-> Hash Join (cost=13.40..128.03 rows=151 width=81)
Hash Cond: ((dag.dag_id)::text = (serialized_dag.dag_id)::text)
-> Seq Scan on dag (cost=0.00..114.08 rows=208 width=48)
-> Hash (cost=11.51..11.51 rows=151 width=33)
-> Seq Scan on serialized_dag (cost=0.00..11.51 rows=151 width=33)
-> Materialize (cost=69432.08..91458.31 rows=65665 width=81)
-> Finalize GroupAggregate (cost=69432.08..90637.50 rows=65665 width=81)
Group Key: task_instance.dag_id, task_instance.task_id, task_instance.state
-> Gather Merge (cost=69432.08..88667.55 rows=131330 width=81)
Workers Planned: 2
-> Partial GroupAggregate (cost=68432.06..72508.79 rows=65665 width=81)
Group Key: task_instance.dag_id, task_instance.task_id, task_instance.state
-> Sort (cost=68432.06..69116.08 rows=273606 width=73)
Sort Key: task_instance.dag_id, task_instance.task_id, task_instance.state
-> Parallel Seq Scan on task_instance (cost=0.00..31564.06 rows=273606 width=73)
Hello to all developers,
I am using this exporter with an Airflow instance having around 80 dags. I have recently figured out that querying
/admin/metrics
endpoint every 30s kills my PostgreSQL database. Each request to get the metrics takes around 18 to 22s and it is causing 100% CPU usage on the instance. While disabled, PostgreSQL is used only in 10-20%. Is there anything we can do about it with this exporter?Kind regards, Szymon