epoch8 / airflow-exporter

Airflow plugin to export dag and task based metrics to Prometheus.
Other
240 stars 75 forks source link

Documentation on the current use case and some thoughts #1

Closed smith-m closed 5 years ago

smith-m commented 5 years ago

This is really interesting - I saw your blog post linked on reddit.

What is your current Airflow setup? 1.9.0? What executor are you running?

If you aren't using the LocalExecutor, does this mean you would scrape metrics from the "web" service? Have you given any thought as to how you would scrape metrics from all components of a CeleryExecutor setup? (web, scheduler, worker pools) or the WIP KubernetesExecutor?

What are you using to visualize the metrics you report for prometheus? Grafana, WeaveScope, StackDriver, something else?

If you did have a shareable dashboard or promql for dashboards it could be a cool resource to evolve with this plugin?

We use prometheus +grafana for monitoring for most things but at the moment have only tapped mysql plugin for grafana and a custom metrics plugin for data based metrics creation. Part of the motivation for this was probably that reporting on metrics in airflow db tied to dags immediately exposed a lot more dag and task information (like execution date, SLAs, plugin specific data) and there were fewer unknowns (for example, I didn't know how simple it was to create new flask endpoints with airflow views - cool to see. I still don't know how to create endpoints on other airflow service components that don't host a flask endpoint). But seeing this does remind me there maybe is a route to simple and effective prometheus monitoring with airflow.

A couple thoughts around prometheus for airflow and this plugin.

Imo a complete prometheus monitoring solution for airflow probably has at least 2 aspects - "service" monitoring (is x airflow component running, what is its timing, has it restarted etc - this may differ depending on the component and your executor setup) and airflow dag/task monitoring (are dags/tasks successful, do the tasks meet slas, how long do they take? Task retries? What is the timing of actual task execution relative to execution_date? When are tasks running? More or less what airflow has tried to do internally, but you know - in prometheus.

Given what I have seen with prometheus, I think the former - "service" metrics is the most typical and strongest prometheus use case - more or less what prometheus was originally intended for. Also could be difficult to implement in Airflow (there are many executors, how would you hook a metrics endpoint into each component?).

The later - dag/task metrics - or perhaps more generally systemic metrics about the completion and timing of scheduled processes - I think is a less well documented use case, yet more intriguing if there is an elegant solution that integrates well with prometheus. Imo this is probably best (not most simply) done with a prometheus push gateway where metrics from the task (can be ephemeral) or dag scheduler gets pushed to prometheus gateway instead of scraped. Similar to recommendations for monitoring cron jobs with prometheus (e.g. https://zpjiang.me/2017/07/10/Push-based-monitoring-for-Prometheus/). I also think it would be really valuable to share and understand visualizations based on this type of prometheus metrics - and in with the specific case of Airflow + prometheus, how would you visually integrate workflow + task information.

A nitpick - I don't think your task and dag status metrics are 100% correct - multiple task instances can exist for the same task - multiple dagruns can exist for the same dag. Depending on your airflow configuration, you may have multiple copies of the same dag / task simultaneously, so this doesn't necessarily generalize very well. Since to my knowledge airflow doesn't natively define task and dag status, only task instance and dag run status, it might make sense to define what it means for a task or dag to be healthy and report on that

elephantum commented 5 years ago

Hi, @smith-m

I'll try to break down your text and answer each part.

What is your current Airflow setup? 1.9.0? What executor are you running?

We're using modified puckel's docker image with LocalExecutor for simplicity.

Have you given any thought as to how you would scrape metrics from all components of a CeleryExecutor setup? (web, scheduler, worker pools) or the WIP KubernetesExecutor?

It's a good question, currently, we care mostly about high-level metrics like dagrun/taskinstance status. But it would be nice to monitor both scheduler and executor in the similar fashion.

What are you using to visualize the metrics you report for prometheus? Grafana, WeaveScope, StackDriver, something else?

We use Grafana for visualization.

Part of the motivation for this was probably that reporting on metrics in airflow db tied to dags immediately exposed a lot more dag and task information (like execution date, SLAs, plugin specific data) and there were fewer unknowns (for example, I didn't know how simple it was to create new flask endpoints with airflow views - cool to see.

I see a development of this plugin going the same direction. Our motivation to use plugin instead of raw SQL queries is that it's a bit more future-proof because it's using same models and logic as the main Airflow app.

A couple thoughts around prometheus for airflow and this plugin.

Totally agree, monitoring "service" metrics of executors is a complex issue (Kubernetes might be the easier one, because of Prom kube autodiscovery mechanisms).

A nitpick

Our current use case is to keep everything "always green" i.e. if some task in dagrun fails, we fix what's necessary and "Clear" this task allowing it to re-run. We repeat this cycle until the task is successfully completed.

We tested our plugin in this setting, we don't get any status="failed" task instances (we know it for sure because we alert on "failed" > 0) :)