Closed wu-sheng closed 1 year ago
I think about Delta, it should be converted as a gauge, no matter if it is
monotonic
or not. @kezhenxu94 What do you think?@mufiye Would you like to try this first as a separate PR?
Sounds good to me.
Is this different from histogram?
The summary shows count, sum and quantile of data. But histogram shows count, sum and bucket of data. I think we can use summary data.
AFAIK, MAL supports histogram, which could access counter, and bucket to get avg or percentile. But summary is not not supported, and it has less precision. If histogram works, we should never choose summary.
The stated receiver in otel collector will transform the timer metric
in airflow to exponentialHistogram
as histogram type . But our skywalking otel-receiver OpenTelemetryMetricRequestProcessor#adaptMetrics
can not support this type. Should I make this exponentialHistogram
type be supported or use summary type for timer metric?
Could you share what is exponentialHistogram
? What does exponential
mean?
It is one kind of data in opentelemetry protocol. For the exponential, sorry I can not explain it clearly, it is the doc of exponentialHistogram
.
message Metric {
reserved 4, 6, 8;
string name = 1;
string description = 2;
string unit = 3;
oneof data {
Gauge gauge = 5;
Sum sum = 7;
Histogram histogram = 9;
ExponentialHistogram exponential_histogram = 10;
Summary summary = 11;
}
}
I just read https://opentelemetry.io/docs/reference/specification/metrics/data-model/#exponentialhistogram, it seems it is just the typical Prometheus Histogram setup in practice.
Back to you question
Should I make this exponentialHistogram type be supported or use summary type for timer metric?
We should transfer this to our histogram, I think. You need to get the bucket transfer correctly from exponentialHistogram
to histogram
.
I just read https://opentelemetry.io/docs/reference/specification/metrics/data-model/#exponentialhistogram, it seems it is just the typical Prometheus Histogram setup in practice.
Back to you question
Should I make this exponentialHistogram type be supported or use summary type for timer metric?
We should transfer this to our histogram, I think. You need to get the bucket transfer correctly from
exponentialHistogram
tohistogram
.
I will try to do it later. And there are some other essential points that need to be discussed.
dag
contains lots of tasks
to be run, pool
is where tasks run in, operator_name
is one kind of task
and job
I think is a larger concept than a task because it also includes the scheduler job. I think we classify all these components as endpoints?ti_successes
metric describes "Overall task instances successes", so we can think the gauge metric as "the successful task instances in this period". But for some nonmonotonic delta counter metric, such as "dag_processing.processes" which means number of currently running DAG parsing processes, it can be negative in one period in my test. How should we handle this kind of metric? Maybe just show them originally for showing the trend of the value.About <1>, the easiest way is,
pool
seems a running env, we could catalog it as an instances, naming through pool:xxx
. Is the pool shares among tasks?job
, dag
, operation_name
could be various endpoints as running processes. We could name them by following /job/xxxx
, /dag/yyy
. Does this make sense?But for some nonmonotonic delta counter metric, such as "dag_processing.processes" which means number of currently running DAG parsing processes, it can be negative in one period in my test. How should we handle this kind of metric?
How do the process could be negative? What does it mean originally? number of currently running DAG parsing processes
should be 0 or positive logically.
But for some nonmonotonic delta counter metric, such as "dag_processing.processes" which means number of currently running DAG parsing processes, it can be negative in one period in my test. How should we handle this kind of metric?
How do the process could be negative? What does it mean originally?
number of currently running DAG parsing processes
should be 0 or positive logically.
Because the total number which is the sum of the gauge value means currently running DAG parsing processes. So one delta value can be negative. The "originally" means we just show the gauge value whether they are negative or positive.
Then, in this case, it seems we never get the absolute value, is it? Does it report absolute time somehow?
Then, in this case, it seems we never get the absolute value, is it? Does it report absolute time somehow?
sorry, I can't get it, could you explain your perspective more?
If a time-series value is delta, let's say (-5, 4, 3, 1, -4), unless we know the initial value is 10(or any value), we could know the exact value of process number
(use your example).
So, do we have that number or do we have the total of processes? If there isn't, we only could see the trend.
If a time-series value is delta, let's say (-5, 4, 3, 1, -4), unless we know the initial value is 10(or any value), we could know the exact value of
process number
(use your example).So, do we have that number or do we have the total of processes? If there isn't, we only could see the trend.
I think we can't get the total number of processes unless we add every delta value.
There is no all
concept. That is my point on delta
issue, we never are able to find out the initial value.
Could you check how this works on stated? Such as check and try https://github.com/apache/airflow/pull/29449?
There is no
all
concept. That is my point ondelta
issue, we never are able to find out the initial value.Could you check how this works on stated? Such as check and try apache/airflow#29449?
You mean to check how airflow collect metrics and send out stated data?
I think about how they visualize this type, so I think we could try this on Prometheus/Grafana. AFAIK, we only could show this value as a trend, I don't know whether there is something we missed.
I think about how they visualize this type, so I think we could try this on Prometheus/Grafana. AFAIK, we only could show this value as a trend, I don't know whether there is something we missed.
Ok, I get it. I will check how they use their metrics.
About <1>, the easiest way is,
pool
seems a running env, we could catalog it as an instances, naming throughpool:xxx
. Is the pool shares among tasks?- the
job
,dag
,operation_name
could be various endpoints as running processes. We could name them by following/job/xxxx
,/dag/yyy
. Does this make sense?
I think tasks and pool are inclusion relation, but others are not. Furthermore, by the metric name, we can not get which task is in which pool. Maybe make these components' level same is the only way.
I don't know as much as you are. Pick a way you prefer, and we could discuss details when dashboards are out. Adjusting these is not hard. Don't worry. Everytime, PR takes time.
I think about how they visualize this type, so I think we could try this on Prometheus/Grafana. AFAIK, we only could show this value as a trend, I don't know whether there is something we missed.
I think it is because the counter definition in promethus metrics. A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
If Prometheus could identify/use it as a counter, why can't we? We converted it to delta because it isn't cumulative. What is missed here?
If Prometheus could identify/use it as a counter, why can't we? We converted it to delta because it isn't cumulative. What is missed here?
In my opinion, I think they always do the accumulation for the counter metrics weather they have been stored or not. But we can not do the accumulation for metrics have been stored. I have not verified the process of counter metrics in promethus because I have not learned the golang. It's my future plan.
In my opinion, I think they always do the accumulation for the counter metrics whether they have been stored or not. But we can not do the accumulation for metrics that have been stored.
If you could push a counter to OAP, we could work on that. Your previous context is about there is a delta only.
In my opinion, I think they always do the accumulation for the counter metrics whether they have been stored or not. But we can not do the accumulation for metrics that have been stored.
If you could push a counter to OAP, we could work on that. Your previous context is about there is a delta only.
I think I can only push a "delta type counter" to the oap by otel collector. I think maybe we can support to accumulate "delta type counter"? It may be complicated but I can try to do it. Or I just show the data trend by "delta type counter" data and do the dashboard first.
I think you need to check what is delta counter. Counter is increasing or reset. How does delta apply to this case?
I think you need to check what is delta counter. Counter is increasing or reset. How does delta apply to this case?
I think this dag_processing.process does not meet the Prometheus counter definition, it can decrease, I'm sure because I test it. It is the pr I find.
That is my point of asking. Only focus on this metric, whether they show, how they show.
That is my point of asking. Only focus on this metric, whether they show, how they show.
Ok, I get it.
@mufiye Any update or block?
@mufiye Any update or block?
I think I should block it here temporarily. I am preparing to find an internship now and have no time to continue this issue in the last two weeks. You can unassign this issue to me. I think the next step is to add one mal function to the meter analyzer. Then write the mal rule and build the dashboard. If anyone take over this task, I can also provide support such as the config file of otel collector. I'm sorry for this situation.
Got it. Thanks for the feedback. Take your time for your own interest. That always matters primarily.
Search before asking
Description
This is an open issue for new contributors. Apache Airflow is a widely used workflow scheduler. We are encouraging someone new to the community to add a new level catalog(workflow) for Airflow.
Metrics
Airflow exposes metrics through StatsD, https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html. We could use StatsD + OpenTelemetry StatesD(https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/statsdreceiver/README.md) + OpenTelemetry OTEL exporter to ship the metrics to SkyWalking OTEL receiver. Then use MAL to build metrics as well as a dashboard for those metrics. Notice, a new layer and new UI menu should be added.
Logging
Airflow supports Fluents to ship metrics, https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-architecture.html. SkyWalking already has FluentD setup support, so we should be able to receive and catalog the logs.
Additionally, Task Logs seems an interesting think. We could use LAL(Log Analysis) to group the logs by task name(or ID) by treating tasks as endpoints(SkyWalking concept).
Use case
Add more observability for Airflow server.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct