apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.19k stars 14.32k forks source link

Metrics Improvement Project #42881

Open dannyl1u opened 1 month ago

dannyl1u commented 1 month ago

Description

From @ferruzzi:

Currently when you add a new metric to the codebase, you must also manually update the docs page. The docs page inevitably gets out of date and misses some details. We want an automated system to generate the docs page based on the actual metrics. There are also known instances where the same metric is being created and emitted in more than one place, causing duplicate data. These will have to be fixed manually and an automated check might possibly (stretch goal?) include checking for same or ”too similar” names while collecting the names for the docs page.

Phase 1 Situation: We support multiple different Metrics backends [0]. The two main ones are StatsD and OpenTelemetry. This is managed though an interface class [1] which is implemented for each backend (examples: StatsD[2] and OTel[3]). StatsD was the only supported version well into Airflow 2.x and the entire codebase was designed with StatsD in mind so it was a good chunk of work to abstract it out and there are a few remaining tasks to perfect the new implementation. Task 1: StatsD has a name length limit of around 300 characters. OTel limits names to 34 characters, but allows tagging. Our temporary solution was to emit almost everything twice, once in the long format for StatsD and again in the short format with tags for OTel. We also had to add code [4] to make sure the name is safe for OTel, and other hacks to make it work. The first task in this project is to understand the difference in how the two implementations handle their names and them add a "get_name" method to the interface: def get_name(metric_name: str, tags: dict[str: str]). In the statsd_logger [2] implementation it will concatenate the tags onto the name and in the OTel implementation it will just return name. Once that is implemented, it can be used in the various emit methods (incr, decr, etc) instead of all the name validation code, and search the code for places where we are emitting things more than once and clean it up. Example: You can see an example in local_task_job_runner [5]. We emit local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> for StatsD but that results in a name too long for OTel so we also emit local_task_job.task_exit, and the name validation method [4] in the OTel implementation catches the one that is too long and just swallows it. What we should do instead is pass incr() the name and the tags and let StatsD and OTel handle them accordingly. [0] https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions [1] https://github.com/apache/airflow/blob/main/airflow/metrics/base_stats_logger.py [2] https://github.com/apache/airflow/blob/main/airflow/metrics/statsd_logger.py [3] https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py [4] https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py#L128 [5] https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job_runner.py#L352

Use case/motivation

From @ferruzzi:

Currently when you add a new metric to the codebase, you must also manually update the docs page. The docs page inevitably gets out of date and misses some details. We want an automated system to generate the docs page based on the actual metrics. There are also known instances where the same metric is being created and emitted in more than one place, causing duplicate data. These will have to be fixed manually and an automated check might possibly (stretch goal?) include checking for same or ”too similar” names while collecting the names for the docs page.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

dannyl1u commented 1 month ago

Hello @potiuk @howardyoo @kaxil, I'm planning on working on this project, and just wanted to confirm that this is indeed something we want for Airflow 3.0 before I begin implementing

ashb commented 1 month ago

This sounds great. Could you sketch out the sort of API you are thinking of, and give an example use from somewhere in scheduler job please?

dannyl1u commented 1 month ago

Hi @ashb, could you clarify what you mean by the use in the scheduler job?

The first step is refactoring the metrics code so that https://github.com/apache/airflow/blob/main/airflow/metrics/base_stats_logger.py will have a get_name method that https://github.com/apache/airflow/blob/main/airflow/metrics/datadog_logger.py, https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py, https://github.com/apache/airflow/blob/main/airflow/metrics/statsd_logger.py, etc. will inherit the method for their own specific naming conventions. This will improve the code by not needing specific name validators for each implementation https://github.com/apache/airflow/blob/main/airflow/metrics/otel_logger.py#L128.

cc'ing @ferruzzi in case I'm missing something

Then, once the above is done, we will plan the next step to build an automated system to generate docs based on the actual metrics.

cc @arshiazr

potiuk commented 1 month ago

@ferruzzi @howardyoo -> I think you two should be quite a bit involved in the review and ideas here.

Initially when we implemented open-telemetry metrics we thought we could do it in the way that we could emit legacy 'statsd" metrics using opentelemetry interface (basically implement or use some kind of opentelemetry -> statsd bridge), because of some implementation details it turned out to be impossible (or difficult).

However that can still be explored, maybe that is still a possiblity? Implementing our own interface that wraps both Opentelemetry Metrics and Statsd one is of course a possibility, but (at least intuitively - without knowing all the details) - it could be that opentelemetry API could be used to emit the legacy statsd events in "mostly" compatible way.

And implementing it in Airflow 3 gives us also an opportunity for the "mostly" part. If we can get 90% of the backwards-compatible statsd metrics in place and only "few", "less important" metrics changed to be incompatible, maybe that is a way to go?

howardyoo commented 1 month ago

Hi Jarek (and Dennis),

Yes, in the beginning we hoped that how Airflow used statsd could also be applied to Airflow Otel in the same way (having that classic statsd metrics name having much of the context - with its long length), and now it looks like we may have to do it the alternate way, but better way for the future (because to be fair, how stats d were naming metrics inherently had problems). for state in State.task_states: Stats.incr( f"ti.finish.{ti.task.dag_id}.{ti.task.task_id}.{state}", count=0, tags=ti.stats_tags, )

Same metric with tagging

Stats.incr( "ti.finish", count=0, tags={**ti.stats_tags, "state": str(state)}, )

As for the naming problem, I do believe that Dennis has added codes to support two versions, so even though they get the warning in the OTEL side, we are not losing too much of the information (see the example code above), but I would say we'll be getting some unwanted trimmed metrics flowing into OTEL.

The only problem in that case would be these unwanted and unnecessary metrics that would flow into OTEL SDK and eventually getting trimmed. However, I believe we can be able to fix it by applying processor to filter them out in the collector level (that's why I do believe having collector in the architecture is important - for it could act as another processing layer separated from the source of telemetry).

Those would be my two cents for now... Any thoughts, Dennis? Maybe I might be missing something.

On Sun, Oct 13, 2024 at 12:00 PM Jarek Potiuk @.***> wrote:

@ferruzzi https://github.com/ferruzzi @howardyoo https://github.com/howardyoo -> I think you two should be quite a bit involved in the review and ideas here.

Initially when we implemented open-telemetry metrics we thought we could do it in the way that we could emit legacy 'statsd" metrics using opentelemetry interface (basically implement or use some kind of opentelemetry -> statsd bridge), because of some implementation details it turned out to be impossible (or difficult).

However that can still be explored, maybe that is still a possiblity? Implementing our own interface that wraps both Opentelemetry Metrics and Statsd one is of course a possibility, but (at least intuitively - without knowing all the details) - it could be that opentelemetry API could be used to emit the legacy statsd events in "mostly" compatible way.

And implementing it in Airflow 3 gives us also an opportunity for the "mostly" part. If we can get 90% of the backwards-compatible statsd interface in place and only "few", "less important" metrics changed to be incompatible, maybe that is a way to go?

— Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/42881#issuecomment-2409050782, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHZNLLUYRUHZ7VWY3HTPWHTZ3KRLLAVCNFSM6AAAAABPVUUOT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBZGA2TANZYGI . You are receiving this because you were mentioned.Message ID: @.***>

ashb commented 1 month ago

Scheduler job = https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py

ArshiaZr commented 1 month ago

I have just raised a new PR (https://github.com/apache/airflow/pull/43018). I'd be grateful if you can review it.

cc @ferruzzi