Closed fritz-clicktripz closed 3 years ago
@fritz-clicktripz thanks for the heads up, that was unexpected.
What would be the best way to deal with this?
Following the approach of https://github.com/epoch8/airflow-exporter/issues/59 - it could be beneficial to just remove it? I don't know if there's a way for users to turn functionality on or off in the plugin via something like environment variables or command line arguments. Task_id's might be unique enough - I haven't used other executors, so I'm not sure what what benefit the combo of hostname+task_id would give.
What was the point of having the host name initially ? Detect one specific worker (if using something like celery) is misbehaving, maybe ?
I don't think that this responsibility should fall on airflow... There are tools to monitor hardware / vm health, airflow is not one of them. And Airflow deployments on Kubernetes, using the KubernetesExecutor, might become more and more common place fast. Given the lack of means to configure the plugin, I would concur: please just remove the label.
I just ran across the hostname
cardinality explosion as well running under kubernetes and echo the sentiment for reverting (or making opt-in) this as a breaking change. Depending on conditions this change can 1) prevent prometheus from pulling metrics due to a timeouts or 2) fill up prometheus with large numbers of time series, which exhausts disk and/or memory, leading to crashes.
While I can empathize with the desire for including the hostname label and have found such values useful for low cardinality metrics, with airflow deployments containing frequently running dags it does not scale.
+1 for the need to make this opt-in or reverted. The growing cardinality is too big of an issue.
+1 the hostname cardinality seems to be an issue
Hi, everyone. Can somebody elaborate on a specific case where cardinality explosion is an issue?
I believe that everybody is using prometheus to monitor all kinds of k8s related resources: pods, jobs etc. All of these have the same setup: pod or job names are autogenerated and unique. Pods and jobs oscillate providing the same kind of cardinality as host label here.
Why is it bad? Or why your prometheus is not configured for that kind of data already? I mean aside from Airflow monitoring, same pods are monitored by prometheus anyways, do you have same issues there?
It appears to me that it's "remembering" all of the old tasks runs and the pod in which it ran. When I look at the hostnames, they correspond to task runs (e.g. the metric airflow_task_status). For each task for the same dag, because hostname is added and kept presumably until the airflow system is restarted. I'm not familiar with this implementation, but I know the golang ones when you add a count/gauge it sort of lives forever unless you implement your own Collect function.
To borrow your analogy, it would be as if the Prometheus running in a Kubernetes cluster reported upon every pod the Kubernetes cluster has ever seen for as long as the cluster has been running. Eventually something needs to let that data fall off when it ages out. And given your endpoint doesn't know the scrape_interval on Prometheus, it's hard to guess what's a correct amount of time to present the data.
My two cents, fully admit I'm a bit of an Airflow newbie.
You seem to forget the the cardinality of the pod ID is not necessarily the matter in itself, it is that times all the other tags combination that come along (there is a combinatorial explosion that can happen there).
Another factor is that time is "elastic" with airflow. When you backfill on long time period, you may create quite a lot of pods in a short effective time period. Of course the effect may be more or less disastrous depending on the length of the backfilling period, the granularity of your scheduling, the speed at which your tasks complete, etc. It may be perfectly okay for your use cases, but not for those of others. I think to the least we should make this configurable.
Expanding on @justin-watkinson-sp's line of reasoning, I just re-enabled v1.3.0 in our dev cluster to see what the data looked like again. For our lower volume dev cluster there are ~26,000 distinct metrics weighing in a 5MB. Pulling all this data from postgres and generating the prometheus metrics hits the web process hard, makes the web process quite unresponsive (GIL, GC, other inefficient construction of metrics?), and probably causes a bit of load on the prometheus server side as well.
I'm not sure if any filtering is applied to the list that is generated or if all task runs are emitted (we have a job that trims data older than 30 days, but that's an extra step that I'd guess many do not do). Perhaps only emitting data pods in the last 1hr-24hrs would address the metric bloat.
Tangentially related - I'm wondering about the original intent of the PR. The field is hostname
but the value is actually the pod name. For me, I think I'd be more interested in the hostname (to know if a particular host is having issues) and not the metrics for any given short lived pod, which would I think reduce the cardinality and # of metrics emitted substantially (though still may not suffice for those with large worker pools). This data may not be available however.
Exporters should mostly concerned itself with the state of "now" and expect that the Prometheus job is set up to a reasonable scrape interval that should answer the question of the user. Could be as little as 1s, or 1d. That's why it's generally a bad idea to try and guess at that as the exporter level, because you simply don't know. My payload size is currently at around 58k lines and 11 MB, so we also have to consider scrape timeouts in the current strategy.
That said, if tracking data over a short series of time is necessary, there are ways to do that. In go you can override the Collect function and work with the scrape using raw metric objects as opposed to the build in counters/gauges in the default registry. Then you can take advantage of things like adding timestamps to the Metric object and Prometheus can be configured to use the honor_timestamps in its job configuration.
It's important to note that once a timestamped value is emit from a target, it cannot be updated (or at least couldn't last time I tried, its an error in the server), but generally time-series databases and time machines don't mix anyways 😄. If I recall correctly you can't emit metrics with timestamps older than your WAL (default 2h) anyways, because that chunk has already been compacted as well.
Finally, I realised the essence of this issue.
Basically the problem is that by using airflow-export we report not only on currently running dags and tasks, but on everything that was ever stored in DB. So there's no way for hostname
cardinality to go down. This is different from monitoring currently running pods in k8s, because currently running pods have constant size and well defined data rotation would keep number of timeseries more or less constant.
It seems that a better mental model for airflow exporter would be not "reporting on the state of physical tasks", but "reporting on the state of the scheduler". In this case, physical attributes should be stripped from tasks metrics.
I will review and accept long-standing PR that deals with hostname
in the following couple of days.
Everyone, sorry for the inconveniences caused.
I released 1.3.2 that removes hostname
from task status metrics.
The recent addition of the
hostname
label, when applied to tasks run with the kubernetes executor, has created a breaking change.We are generating unlimited cardinality - each hostname is a pod with a unique identifier suffix in our case.