epoch8 / airflow-exporter

Airflow plugin to export dag and task based metrics to Prometheus.
Other
247 stars 75 forks source link

[#9] Add Duration of DAG Runs #11

Closed hydrosquall closed 6 years ago

hydrosquall commented 6 years ago

This targets feature request #9. It Reports the duration that currently running DagRuns have been running for.

This can be used when people are trying to alert based on DagRuns that have gone on longer than expected.

elephantum commented 6 years ago

@hydrosquall do you think it's possible to have generalized metric "duration of oldest dagrun for specific status"?

i.e. sometimes we have a problem when dagrun is stuck in "queued" state. we'd like to alert when dagrun is in queued for more than one hour.

hydrosquall commented 6 years ago

Hi @elephantum -

I checked the DagRun model, and it looks like there are only 3 possible DagRun states (running, success, failure). I wanted to capture generalized duration for all 3 states, but the problem is that end_date was not always stored on the dagrun.

https://github.com/apache/incubator-airflow/blob/1f038a7919207338471d31890f76e71e5cb4571c/airflow/utils/state.py#L60

Queued status alerts are possible, but that felt to me like it would belong to a Duration of TaskInstance metric instead.

elephantum commented 6 years ago

You're right. I was thinking about TaskInstance while your PR is about DagRun. I'm checking it locally and merging.

elephantum commented 6 years ago

@hydrosquall One more question: as I see in a situation when we have three simultaneous DagRuns for the same dag_id we'll have three metrics for this dag_id. Will this be actually useful?

What is your target scenario for monitoring?

hydrosquall commented 6 years ago

Good question! I believe each one of the DagRuns will create a unique row with its own run_id, and it's completely fine for multiple run_ids to share the same dag_id.

In the scenario where I'm using this, 1 dag_id will only have 1 active run at a time. However, I don't believe that would break the sort of alerting that I want if there were multiple DagRuns happening concurrently, since I'm interested in being notified if any DagRuns are going for beyond a certain period of time.

elephantum commented 6 years ago

Ok, I can't see any issues with this approach.