Open vDMG opened 2 years ago
Thanks for opening your first issue here! Be sure to follow the issue template!
please assign to me
Well, in my opinion we have two ways:
Also it's strange, but I found a duplicate of executor.running_tasks - scheduler.tasks.running. But it seems like we have to delete it because it's always zero:
num_tasks_in_executor = 0
Stats.gauge("scheduler.tasks.running", num_tasks_in_executor)
It doesn't change and we have only two using of the variable num_tasks_in_executor.
What do you think? @potiuk @eladkal
In 1. I mean, that we have to change logic of heartbeat for scheduler, cause for now it's sending for each scheduler independently.
But we can have some separate process for getting heartbeat from db and sending it, in that way it seems clear and better for me
To my perspective each scheduler should have it's own metric as it's a separated process. Summing/aggregating might hide underlying issue. To my knowledge, most "metric display/tracker" tools have the ability to aggregate from their side so I'd prefer to give data as raw as we can. However this is not my area of expertise thus lets hear what others think. cc @ferruzzi WDYT?
Would be complex and brittle and is against the "distributed" system architecture, we would have to elect the "leader" (if we want to publish it by one of the schedulers or add a new entity to publish such metrics. Sounds complicating and smells zookeper (https://zookeeper.apache.org/) which is something we definitely want to avoid.
The problem is that currently we have no unique scheduler id that we can rely on. The schedulers are "identical" copies of the same process and we neither enumerate them, nor have another way of distinguishing them. So far at least.
Yes. Option 2. is better but the fact that we cannot distinuish them, makes it dificult, both to generate the metrics but even more difficult to aggregate them. The problem is that when we restart scheduler, the id (possibly) should not change, otherwise the function aggregating the metrics will see that as "another" scheduler. This migh or might not be a problem - depending on the metrics.
We have few options there:
1) use uuid or simlar at the start of the scheduler to generate the name - we have a few UUID candidates there https://docs.python.org/3/library/uuid.html but none of them guarantees that the id will remain unchanged after the restart (especially in container environment). It could be random or hardware-based (or both).
2) similarly to workers we use hostname callable to get a unique id of teh scheduler when it starts. this is not a fool-proof either, because the hostname/IP might also (and often does) change after the restart.
The problem is tha Option 2) and non-random UUID generation does not guarantee uniqueness if a user chooses to run several schedulers at the same hardware without containerisation - 2 schedulers running on the same hardware will have the same id (unless randomness is involved). This is not very likely, but possible scenario - because running multiple schedulers on the same hardware with multi-processing is generally a good idea (because such schedulers will be faster due to Python multi-processing/GIL limitaitons). Probably the UUID calculation and hostname should require also the port number on which scheduler is exposing healthcheck or something like that.
3) add a flag to specify scheduler ID when starting it (this might override 1) or 2) in cases when the ID is expected to remain the same for specific deployment.
I think the metrics in question should be fine to aggregate if the UUID is randomly allocated (I have not looked at those that are involved so it needs to be checked metrics-by-mytrics. However it will result in having changing uuids over time while scheduler is running and being restarted. Not sure how statsd monitoring software / aggregation will handle it - something to be investingated by someone doing such a change. Also the 3) might be a way to limit the impact of that, but at the expense of complicating deployment. Currently we add a new scheduler by just starting another scheduler process. If we also add the option to specify the id of the scheduler when starting, the Helm Chart of ours should likely be updated with that option. And documentation should be written to include that option.
Generally doable, but just wanted to make everyone aware in the discussion that is quite a bit more complex than "just send metrics with unique id". It would be simpler if we had the id, but we don't have it currently and there are implications of having it.
okay, I will investigate the 2nd option.
Thanks for the tag, @eladkal. Yeah, I agree with how the conversation has progressed for sure. Separate in name (and add the name to the metric tags) seems like it would be ideal. @potiuk's response about the difficulties in that definitely makes it sound like a challenge.
Since I mentioned taking on the OTel integration, I've had one or two people tell me that not all of the metrics listed on our docs page work, so i planned to double check as many as I can once I have OTel working and update that page or he code as appropriate, but that's a bit of a longer-term goal at this point. Thanks for pointing this one out, @vDMG, and thanks for taking it on @Yaro1 :+1:
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
@ferruzzi Worth looking at this again as part of the workstream where you guys are looking to audit all metrics
Apache Airflow version
Other Airflow 2 version
What happened
When I am running multiple schedulers (>1) the statsd exporter does not correctly sum the running tasks from the different scheduler but it does just take one metric from one scheduler and display it as the value.
the metric name :
airflow_executor_running_tasks
What you think should happen instead
airflow statsd exporter should sum up all running tasks from all the existing and running schedulers to calculate
airflow_executor_running_tasks
How to reproduce
Run an airflow cluster with 2 schedulers and with statsd enabled.
Create enough tasks to balance them between the 2 schedulers. Using the UI you can check that between the displayed value and the
airflow_executor_running_tasks
metric exposed by statsd you will not have the same number?Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-docker==2.7.0 apache-airflow-providers-elasticsearch==3.0.3 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-google==7.0.0 apache-airflow-providers-grpc==2.0.4 apache-airflow-providers-hashicorp==2.2.0 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-microsoft-azure==3.9.0 apache-airflow-providers-mysql==2.2.3 apache-airflow-providers-odbc==2.0.4 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sendgrid==2.0.4 apache-airflow-providers-sftp==2.6.0 apache-airflow-providers-slack==4.2.3 apache-airflow-providers-sqlite==2.1.3 apache-airflow-providers-ssh==2.4.4
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct