apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.15k stars 14.32k forks source link

POC on configuring and running Airflow with basic OT integration (selected metrics) #19972

Closed potiuk closed 2 years ago

potiuk commented 2 years ago

Includes:

subkanthi commented 2 years ago

I was thinking of also supporting prometheus, we could push to prometheus pushgateway, even though its not recommended. But it will be minimum changes. High level looking at the stats class, we could probably do it like the datadog statsd implementation.

https://prometheus.io/docs/practices/pushing/

This might be quite useful in the short term as you dont need an additional statsd -> prometheus exporter process running and we can connect to an existing prometheus instance(also managed AWS and Google)

potiuk commented 2 years ago

In the long term, this one (whole project) is to see if we can get rid of statsd/datadog entirely and only use open-telemetry. Once we have OpenTelemetry support we should not need specific prometheus integration.

We can use https://open-telemetry.github.io/opentelemetry-python/exporter/prometheus/prometheus.html to export all telemetry to Prometheus.

Why would we want separate Prometheus-specific implementation if our goal is to be monitoring-system agnostic (and this is what Open Telemetry promise is basically) ?

subkanthi commented 2 years ago

Definitely not disagreeing with the long term goal of supporting opentelemetry, I was just suggesting a short term initiative of supporting Prometheus with our current setup , minimum changes to stats class and keeping the architecture of pushing metrics. Counter , gauge concepts are the same with Prometheus and statsd

potiuk commented 2 years ago

Whatever we come up now - we will have to support with backwards compatibility promise in the future. I am pretty sure we do not want a short-term prometeus support now if we have a long term plan to support it differently.

Currently there are statsd exporters people can use (and do use) to export statsd to prometheus. In the future we will have (if the POC will work and AIP accepted) we will have opentelemetry-exporter that people would use (And we will support open-telemetnry and statsd for backwards compatibility).

What purpose would it serve to add yet another option to Airflow - one that again would have to be deprecated and dropped (especially that there is a working method to get the metrics in Prometheus via exporter) ?

What problem would it solve?

Melodie97 commented 2 years ago

Currently looking at how to go about this, will keep everyone updated

potiuk commented 2 years ago

Also @Melodie97 - see thee #20053 - it turned out that the idea of @subkanthi was about adding a provider to expose PushGateway API via provider, so it is indeed different from our proposal.

In short what @subkanthi proposed is to have a separate Prometheus Provider with a Hook/Operator that will be available to the users of airflow to be used as something like that in the Dag (conceptually):

@task
def run_biguery_job():
     hook_bq = BigQueryHook()
     bq_result = hook_bg.run_the_job()
     hook_prometheus = PrometheusHook()
     bq_job_metrics = get_metrics_for_bq_job(bq_result)
     hook_prometheus.push_metrics_to_gateway(bq_job_metrics)

Is that the right "assesment" @subkanthi ?

subkanthi commented 2 years ago

Also @Melodie97 - see thee #20053 - it turned out that the idea of @subkanthi was about adding a provider to expose PushGateway API via provider, so it is indeed different from our proposal.

In short what @subkanthi proposed is to have a separate Prometheus Provider with a Hook/Operator that will be available to the users of airflow to be used as something like that in the Dag (conceptually):

@task
def run_biguery_job():
     hook_bq = BigQueryHook()
     bq_result = hook_bg.run_the_job()
     hook_prometheus = PrometheusHook()
     bq_job_metrics = get_metrics_for_bq_job(bq_result)
     hook_prometheus.push_metrics_to_gateway(bq_job_metrics)

Is that the right "assesment" @subkanthi ?

Yes thats correct @potiuk.

cherusk commented 2 years ago

Pardon, but from the formulation of the code examples and the overall issue, I'm not certain will the eventual "package" contain a prometheus operator that can reach to prometheus API directly. Something along the lines of a wrapper around SimpleHttpOperator as a minimum viable technical approach I'd expect.

I don't know why this needs to be confined to the prometheus push gateway which is a very specific interaction with prometheus.

Thanks for clarifying!

Relevant for: https://github.com/apache/airflow/issues/11549

potiuk commented 2 years ago

@cherusk We should completely separate #11549 from this discussion.

It has nothing to do with the instrumentation/OT integration that we are working on with @Melodie97 . The #11549 is completely different issue (and quite misleading in the context of opentelemetry integration).

What is described in #11549 is a dedicated, Prometheus operator that will an existing push gatway setup outside of Airlfow and allow "DAG writers" to push any metrics they want.

What we are describing here is to provide an open-telemetry integration with Airlfow as a platform. The OT integration we want to add here will (eventually) provide a way for the "admin" of Airflow to set it up in the way that will provide various useful metrics of Airflow platform as a whole, statistics of dag execution etc. All this independenly on the metrics collector and visualisation. Just gathering the metrics with OpenTelemetry and allowing the admin user to configure any exporter they want.

So both the scope and the audience of those two are different.