PBWebMedia / airflow-prometheus-exporter

Export Airflow metrics (from mysql) in prometheus format
MIT License
29 stars 10 forks source link

Add dagRunTime metrics #8

Open atrbgithub opened 6 years ago

atrbgithub commented 6 years ago

This adds a DAG current runtime metric.

For each DAG, if it is not running, -1 is returned. If it is running, it shows the current runtime. This is useful for alerting on DAGs which may be taking much longer than usual.

DemonTPx commented 6 years ago

A DAG can have multiple instances running at the same time. This would conflict with the metric which has one run time per DAG.

Also, returning the current runtime makes interpolating between scrapes quite difficult. What if you want to query the runtime, but the DAG ran between two scrapes?

I also think returning -1 is not a great idea. It makes calculations with the metric impossible.

I think a better solution would be to sum all the DAGs run times in seconds. After that, we could use the increase function to see how much time is spent on a DAG between certain intervals. You would also be able to sum the run times of multiple DAG's.

atrbgithub commented 6 years ago

A DAG can have multiple instances running at the same time. This would conflict with the metric which has one run time per DAG.

I guess the metric could be named a little better. Rather than dag_current_run_time it should be dag_instance_current_max_run_time (though I'm not sure that is much better 🙂 ). The sql query is written so that it will look for the longest running instance of a DAG which is currently executing. The reason behind it was that wanted to be able to alert on a DAG run which had been running for a long period of time. We were using the SLA feature in Airflow, but this only triggers once a DAG has finished executing.

Also, returning the current runtime makes interpolating between scrapes quite difficult. What if you want to query the runtime, but the DAG ran between two scrapes?

This is fine for what we need it for, such as a simple alert that a DAG instance has been running for > 24 hours.

I also think returning -1 is not a great idea. It makes calculations with the metric impossible.

Should work fine for a simple > comparison. If the DAG is not running, what would a better value be, zero? Or perhaps omit it from the metrics returned?

I think a better solution would be to sum all the DAGs run times in seconds. After that, we could use the increase function to see how much time is spent on a DAG between certain intervals. You would also be able to sum the run times of multiple DAG's.

Do you mean summing all the currently executing instances of a DAG, or summing all dags?

samwedge commented 5 years ago

Hi @atrbgithub @DemonTPx

Just looking for some closure on this, so the issues raised above can be addressed.

Just to give some context, as it might help you to understand why the sql is how it is. We are using this metric to issue alerts if a DAG has a run that takes longer than usual. We could rename the metric to dag_run_current_max (or similar) if this offers clarification?

A DAG can have multiple instances running at the same time. This would conflict with the metric which has one run time per DAG.

Taking the longest running of all runs for a given DAG is acceptable for our use case. The SQL addresses this. It allows us to get one alert per DAG so that investigation can take place.

Also, returning the current runtime makes interpolating between scrapes quite difficult. What if you want to query the runtime, but the DAG ran between two scrapes?

If the DAG run has completed between two scrapes, then no alert is required as the DAG run has completed and any potential issue has been resolved. If the DAG is in a running state, then some logic can be performed on the metric to determine if the run time is anomalous and an alert generated if required.

I also think returning -1 is not a great idea. It makes calculations with the metric impossible.

The reason for -1 is so we can update the metric if no DAG runs are in a running state. This allows us to resolve any alerts that are generated from this metric. We can, if preferable, return a value of 0?

I think a better solution would be to sum all the DAGs run times in seconds. After that, we could use the increase function to see how much time is spent on a DAG between certain intervals. You would also be able to sum the run times of multiple DAG's.

Summing together all runs in this way would prevent us from doing what we want with the metric, which is to alert if a particular DAG has a run with an anomalous run time