Statsd aggregation issue on some metrics with multiple schedulers running

vDMG commented 2 years ago

Apache Airflow version

Other Airflow 2 version

What happened

When I am running multiple schedulers (>1) the statsd exporter does not correctly sum the running tasks from the different scheduler but it does just take one metric from one scheduler and display it as the value.

the metric name : airflow_executor_running_tasks

What you think should happen instead

airflow statsd exporter should sum up all running tasks from all the existing and running schedulers to calculate airflow_executor_running_tasks

How to reproduce

Run an airflow cluster with 2 schedulers and with statsd enabled.

Create enough tasks to balance them between the 2 schedulers. Using the UI you can check that between the displayed value and the airflow_executor_running_tasks metric exposed by statsd you will not have the same number?

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.4.0 apache-airflow-providers-celery==2.1.4 apache-airflow-providers-cncf-kubernetes==4.0.2 apache-airflow-providers-docker==2.7.0 apache-airflow-providers-elasticsearch==3.0.3 apache-airflow-providers-ftp==2.1.2 apache-airflow-providers-google==7.0.0 apache-airflow-providers-grpc==2.0.4 apache-airflow-providers-hashicorp==2.2.0 apache-airflow-providers-http==2.1.2 apache-airflow-providers-imap==2.2.3 apache-airflow-providers-microsoft-azure==3.9.0 apache-airflow-providers-mysql==2.2.3 apache-airflow-providers-odbc==2.0.4 apache-airflow-providers-postgres==4.1.0 apache-airflow-providers-redis==2.0.4 apache-airflow-providers-sendgrid==2.0.4 apache-airflow-providers-sftp==2.6.0 apache-airflow-providers-slack==4.2.3 apache-airflow-providers-sqlite==2.1.3 apache-airflow-providers-ssh==2.4.4

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

Yaro1 commented 1 year ago

please assign to me

Yaro1 commented 1 year ago

Well, in my opinion we have two ways:

try to sum up that metric for every scheduler
send metric with some id of scheduler and then users can sum it by themself

Also it's strange, but I found a duplicate of executor.running_tasks - scheduler.tasks.running. But it seems like we have to delete it because it's always zero:

num_tasks_in_executor = 0

Stats.gauge("scheduler.tasks.running", num_tasks_in_executor)

It doesn't change and we have only two using of the variable num_tasks_in_executor.

What do you think? @potiuk @eladkal

Yaro1 commented 1 year ago

In 1. I mean, that we have to change logic of heartbeat for scheduler, cause for now it's sending for each scheduler independently.

Yaro1 commented 1 year ago

But we can have some separate process for getting heartbeat from db and sending it, in that way it seems clear and better for me

eladkal commented 1 year ago

To my perspective each scheduler should have it's own metric as it's a separated process. Summing/aggregating might hide underlying issue. To my knowledge, most "metric display/tracker" tools have the ability to aggregate from their side so I'd prefer to give data as raw as we can. However this is not my area of expertise thus lets hear what others think. cc @ferruzzi WDYT?

potiuk commented 1 year ago

Would be complex and brittle and is against the "distributed" system architecture, we would have to elect the "leader" (if we want to publish it by one of the schedulers or add a new entity to publish such metrics. Sounds complicating and smells zookeper (https://zookeeper.apache.org/) which is something we definitely want to avoid.
The problem is that currently we have no unique scheduler id that we can rely on. The schedulers are "identical" copies of the same process and we neither enumerate them, nor have another way of distinguishing them. So far at least.

Yes. Option 2. is better but the fact that we cannot distinuish them, makes it dificult, both to generate the metrics but even more difficult to aggregate them. The problem is that when we restart scheduler, the id (possibly) should not change, otherwise the function aggregating the metrics will see that as "another" scheduler. This migh or might not be a problem - depending on the metrics.

We have few options there:

1) use uuid or simlar at the start of the scheduler to generate the name - we have a few UUID candidates there https://docs.python.org/3/library/uuid.html but none of them guarantees that the id will remain unchanged after the restart (especially in container environment). It could be random or hardware-based (or both).

2) similarly to workers we use hostname callable to get a unique id of teh scheduler when it starts. this is not a fool-proof either, because the hostname/IP might also (and often does) change after the restart.

The problem is tha Option 2) and non-random UUID generation does not guarantee uniqueness if a user chooses to run several schedulers at the same hardware without containerisation - 2 schedulers running on the same hardware will have the same id (unless randomness is involved). This is not very likely, but possible scenario - because running multiple schedulers on the same hardware with multi-processing is generally a good idea (because such schedulers will be faster due to Python multi-processing/GIL limitaitons). Probably the UUID calculation and hostname should require also the port number on which scheduler is exposing healthcheck or something like that.

3) add a flag to specify scheduler ID when starting it (this might override 1) or 2) in cases when the ID is expected to remain the same for specific deployment.

I think the metrics in question should be fine to aggregate if the UUID is randomly allocated (I have not looked at those that are involved so it needs to be checked metrics-by-mytrics. However it will result in having changing uuids over time while scheduler is running and being restarted. Not sure how statsd monitoring software / aggregation will handle it - something to be investingated by someone doing such a change. Also the 3) might be a way to limit the impact of that, but at the expense of complicating deployment. Currently we add a new scheduler by just starting another scheduler process. If we also add the option to specify the id of the scheduler when starting, the Helm Chart of ours should likely be updated with that option. And documentation should be written to include that option.

Generally doable, but just wanted to make everyone aware in the discussion that is quite a bit more complex than "just send metrics with unique id". It would be simpler if we had the id, but we don't have it currently and there are implications of having it.

Yaro1 commented 1 year ago

okay, I will investigate the 2nd option.

ferruzzi commented 1 year ago

Thanks for the tag, @eladkal. Yeah, I agree with how the conversation has progressed for sure. Separate in name (and add the name to the metric tags) seems like it would be ideal. @potiuk's response about the difficulties in that definitely makes it sound like a challenge.

Since I mentioned taking on the OTel integration, I've had one or two people tell me that not all of the metrics listed on our docs page work, so i planned to double check as many as I can once I have OTel working and update that page or he code as appropriate, but that's a bit of a longer-term goal at this point. Thanks for pointing this one out, @vDMG, and thanks for taking it on @Yaro1 :+1:

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

github-actions[bot] commented 6 months ago

This issue has been closed because it has not received response from the issue author.

kaxil commented 2 months ago

@ferruzzi Worth looking at this again as part of the workstream where you guys are looking to audit all metrics

apache / airflow