apache / skywalking

APM, Application Performance Monitoring System
https://skywalking.apache.org/
Apache License 2.0
23.72k stars 6.5k forks source link

[Feature] Setup dashboard for Airflow monitoring #10341

Closed wu-sheng closed 1 year ago

wu-sheng commented 1 year ago

Search before asking

Description

This is an open issue for new contributors. Apache Airflow is a widely used workflow scheduler. We are encouraging someone new to the community to add a new level catalog(workflow) for Airflow.

Metrics

Airflow exposes metrics through StatsD, https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html. We could use StatsD + OpenTelemetry StatesD(https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/statsdreceiver/README.md) + OpenTelemetry OTEL exporter to ship the metrics to SkyWalking OTEL receiver. Then use MAL to build metrics as well as a dashboard for those metrics. Notice, a new layer and new UI menu should be added.

Logging

Airflow supports Fluents to ship metrics, https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-architecture.html. SkyWalking already has FluentD setup support, so we should be able to receive and catalog the logs.

Additionally, Task Logs seems an interesting think. We could use LAL(Log Analysis) to group the logs by task name(or ID) by treating tasks as endpoints(SkyWalking concept).

Use case

Add more observability for Airflow server.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

kezhenxu94 commented 1 year ago

I think about Delta, it should be converted as a gauge, no matter if it is monotonic or not. @kezhenxu94 What do you think?

@mufiye Would you like to try this first as a separate PR?

Sounds good to me.

mufiye commented 1 year ago

Is this different from histogram?

The summary shows count, sum and quantile of data. But histogram shows count, sum and bucket of data. I think we can use summary data.

AFAIK, MAL supports histogram, which could access counter, and bucket to get avg or percentile. But summary is not not supported, and it has less precision. If histogram works, we should never choose summary.

The stated receiver in otel collector will transform the timer metric in airflow to exponentialHistogram as histogram type . But our skywalking otel-receiver OpenTelemetryMetricRequestProcessor#adaptMetrics can not support this type. Should I make this exponentialHistogram type be supported or use summary type for timer metric?

wu-sheng commented 1 year ago

Could you share what is exponentialHistogram? What does exponential mean?

mufiye commented 1 year ago

It is one kind of data in opentelemetry protocol. For the exponential, sorry I can not explain it clearly, it is the doc of exponentialHistogram.

message Metric {
  reserved 4, 6, 8;

  string name = 1;

  string description = 2;

  string unit = 3;

  oneof data {
    Gauge gauge = 5;
    Sum sum = 7;
    Histogram histogram = 9;
    ExponentialHistogram exponential_histogram = 10;
    Summary summary = 11;
  }
}
wu-sheng commented 1 year ago

I just read https://opentelemetry.io/docs/reference/specification/metrics/data-model/#exponentialhistogram, it seems it is just the typical Prometheus Histogram setup in practice.

Back to you question

Should I make this exponentialHistogram type be supported or use summary type for timer metric?

We should transfer this to our histogram, I think. You need to get the bucket transfer correctly from exponentialHistogram to histogram.

mufiye commented 1 year ago

I just read https://opentelemetry.io/docs/reference/specification/metrics/data-model/#exponentialhistogram, it seems it is just the typical Prometheus Histogram setup in practice.

Back to you question

Should I make this exponentialHistogram type be supported or use summary type for timer metric?

We should transfer this to our histogram, I think. You need to get the bucket transfer correctly from exponentialHistogram to histogram.

I will try to do it later. And there are some other essential points that need to be discussed.

  1. There are some labels in airflow metric names such as , , , , which represent the components in airflow. Which level should I classify these components to? In airflow concepts, dag contains lots of tasks to be run, pool is where tasks run in, operator_name is one kind of task and job I think is a larger concept than a task because it also includes the scheduler job. I think we classify all these components as endpoints?
  2. As before, we transform the "delta counter metric" to "gauge metric", we can represent some monotonic delta metric as the current metric in one specific period of time. For example, ti_successes metric describes "Overall task instances successes", so we can think the gauge metric as "the successful task instances in this period". But for some nonmonotonic delta counter metric, such as "dag_processing.processes" which means number of currently running DAG parsing processes, it can be negative in one period in my test. How should we handle this kind of metric? Maybe just show them originally for showing the trend of the value.
wu-sheng commented 1 year ago

About <1>, the easiest way is,

wu-sheng commented 1 year ago

But for some nonmonotonic delta counter metric, such as "dag_processing.processes" which means number of currently running DAG parsing processes, it can be negative in one period in my test. How should we handle this kind of metric?

How do the process could be negative? What does it mean originally? number of currently running DAG parsing processes should be 0 or positive logically.

mufiye commented 1 year ago

But for some nonmonotonic delta counter metric, such as "dag_processing.processes" which means number of currently running DAG parsing processes, it can be negative in one period in my test. How should we handle this kind of metric?

How do the process could be negative? What does it mean originally? number of currently running DAG parsing processes should be 0 or positive logically.

Because the total number which is the sum of the gauge value means currently running DAG parsing processes. So one delta value can be negative. The "originally" means we just show the gauge value whether they are negative or positive.

wu-sheng commented 1 year ago

Then, in this case, it seems we never get the absolute value, is it? Does it report absolute time somehow?

mufiye commented 1 year ago

Then, in this case, it seems we never get the absolute value, is it? Does it report absolute time somehow?

sorry, I can't get it, could you explain your perspective more?

wu-sheng commented 1 year ago

If a time-series value is delta, let's say (-5, 4, 3, 1, -4), unless we know the initial value is 10(or any value), we could know the exact value of process number(use your example).

So, do we have that number or do we have the total of processes? If there isn't, we only could see the trend.

mufiye commented 1 year ago

If a time-series value is delta, let's say (-5, 4, 3, 1, -4), unless we know the initial value is 10(or any value), we could know the exact value of process number(use your example).

So, do we have that number or do we have the total of processes? If there isn't, we only could see the trend.

I think we can't get the total number of processes unless we add every delta value.

wu-sheng commented 1 year ago

There is no all concept. That is my point on delta issue, we never are able to find out the initial value.

Could you check how this works on stated? Such as check and try https://github.com/apache/airflow/pull/29449?

mufiye commented 1 year ago

There is no all concept. That is my point on delta issue, we never are able to find out the initial value.

Could you check how this works on stated? Such as check and try apache/airflow#29449?

You mean to check how airflow collect metrics and send out stated data?

wu-sheng commented 1 year ago

I think about how they visualize this type, so I think we could try this on Prometheus/Grafana. AFAIK, we only could show this value as a trend, I don't know whether there is something we missed.

mufiye commented 1 year ago

I think about how they visualize this type, so I think we could try this on Prometheus/Grafana. AFAIK, we only could show this value as a trend, I don't know whether there is something we missed.

Ok, I get it. I will check how they use their metrics.

mufiye commented 1 year ago

About <1>, the easiest way is,

  • pool seems a running env, we could catalog it as an instances, naming through pool:xxx. Is the pool shares among tasks?
  • the job, dag, operation_name could be various endpoints as running processes. We could name them by following /job/xxxx, /dag/yyy. Does this make sense?

I think tasks and pool are inclusion relation, but others are not. Furthermore, by the metric name, we can not get which task is in which pool. Maybe make these components' level same is the only way.

wu-sheng commented 1 year ago

I don't know as much as you are. Pick a way you prefer, and we could discuss details when dashboards are out. Adjusting these is not hard. Don't worry. Everytime, PR takes time.

mufiye commented 1 year ago

I think about how they visualize this type, so I think we could try this on Prometheus/Grafana. AFAIK, we only could show this value as a trend, I don't know whether there is something we missed.

I think it is because the counter definition in promethus metrics. A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

wu-sheng commented 1 year ago

If Prometheus could identify/use it as a counter, why can't we? We converted it to delta because it isn't cumulative. What is missed here?

mufiye commented 1 year ago

If Prometheus could identify/use it as a counter, why can't we? We converted it to delta because it isn't cumulative. What is missed here?

In my opinion, I think they always do the accumulation for the counter metrics weather they have been stored or not. But we can not do the accumulation for metrics have been stored. I have not verified the process of counter metrics in promethus because I have not learned the golang. It's my future plan.

wu-sheng commented 1 year ago

In my opinion, I think they always do the accumulation for the counter metrics whether they have been stored or not. But we can not do the accumulation for metrics that have been stored.

If you could push a counter to OAP, we could work on that. Your previous context is about there is a delta only.

mufiye commented 1 year ago

In my opinion, I think they always do the accumulation for the counter metrics whether they have been stored or not. But we can not do the accumulation for metrics that have been stored.

If you could push a counter to OAP, we could work on that. Your previous context is about there is a delta only.

I think I can only push a "delta type counter" to the oap by otel collector. I think maybe we can support to accumulate "delta type counter"? It may be complicated but I can try to do it. Or I just show the data trend by "delta type counter" data and do the dashboard first.

wu-sheng commented 1 year ago

I think you need to check what is delta counter. Counter is increasing or reset. How does delta apply to this case?

mufiye commented 1 year ago

I think you need to check what is delta counter. Counter is increasing or reset. How does delta apply to this case?

I think this dag_processing.process does not meet the Prometheus counter definition, it can decrease, I'm sure because I test it. It is the pr I find.

wu-sheng commented 1 year ago

That is my point of asking. Only focus on this metric, whether they show, how they show.

mufiye commented 1 year ago

That is my point of asking. Only focus on this metric, whether they show, how they show.

Ok, I get it.

wu-sheng commented 1 year ago

@mufiye Any update or block?

mufiye commented 1 year ago

@mufiye Any update or block?

I think I should block it here temporarily. I am preparing to find an internship now and have no time to continue this issue in the last two weeks. You can unassign this issue to me. I think the next step is to add one mal function to the meter analyzer. Then write the mal rule and build the dashboard. If anyone take over this task, I can also provide support such as the config file of otel collector. I'm sorry for this situation.

wu-sheng commented 1 year ago

Got it. Thanks for the feedback. Take your time for your own interest. That always matters primarily.