Closed raphaelauv closed 9 months ago
I believe just shutting down task in this case will not solve the problem and retrying by cleery is a good strategy, accounting for possible intermittent reason for the problem.
Have you checked if there is a celary configuration that can behave differntly in this way? Will try infinitely or the result backend you chose has some configuration to set maximum retries? Have you exhausted all the celery configuration options for that?
If there is an easy way to recover by celery itself via a configuration of seeting max retries or similar (for the chosen backend) then solution is to choose that configuration.
Also - I think the issue is wrongly stated - If the problem is of the nature you described, all the processes on the same worker will not have the connection to DB for any of the tasks they are using and this is the same for any other forked processes on the same worker. So there is a much bigger problem that should be dealt with on the deployment level.
I think there is no expectation that airflow components (like celery worker) will handle all such exceptions on their own and self-healing from any kinds of situation like that. Just stopping a worker will not solve the REAL issue which is that deployment of celery worker component stopped working properly.
Do you have healthcheck/liveness probes runnig that would automatically detect such situations and shutdown/restart such faulty component in place?
Other than that, I think it's quite unreasonable to expect that application (python process) will recover from the situations where suddenly during execution some of the basic assumptions the process had (ability to uninterruptibly connect to the metadata DB) are not holding any more. Airflow on its own is not able to self-heal the environment , this is why many of the deployments have healthcheck and liveness probes to check if the software they control is still "alive" and can react in those cases (kill and fallback the process that shows problem on another machine most of the time).
Would love to hear from you what kind of deployment level you have in place for that - that might also serve others to explain it here to see what "deployment managers" should look at when experiencing similar problems and what their monitoring should include.
the airflow is deployed with the apache airflow helm chart but with restriction on the pgbouncer (that only accept precise port-range of source connection) and the forked process in the airflow worker are using a "random" range. Solution is to soften the range-port limitation on the pgbouncer
( the health check of the worker was not failing cause the worker itself was able to connect to the pgbouncer )
since it's a network policy specific things and as you explained we don't want airflow to manage that case , i'm closing the issue
thank you for your explanation and your reactivity on the issue :+1:
Apache Airflow version
2.8.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
in CeleryExecutor
if the db is not available to the fork process that is running the task "in" the airflow worker process ( for any invalid network configuration , like restriction on requests port connection )
tasks that save a xcom at the end of the run are never fail by the worker and stay in running state
What you think should happen instead?
the worker shoudl fail the tasks and not retry indefinitely
How to reproduce
break the connection between the worker and the db ( or the pgbouncer )
tasks that need to save a xcom will stay in running
Operating System
ubuntu 22.04
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct