Closed junaidnasir-ps closed 2 years ago
Does the container get restarted because of the probes? Could you share the logs of the worker?
It says up for retry if i set the retry in dag, otherwise it just fails the job.
Hi Junaid
This behaviour could be in some manner expected. By default this readiness probe starts 30 seconds after the pod startup, and the main container could not be ready at the very beginning. In your case, maybe the long running dag is causing the readinessProbe is taking more than expected. It might be worth increasing the worker.readinessProbe.initialDelaySeconds
From your comment above:
It says up for retry if i set the retry in dag, otherwise it just fails the job.
I am assuming that your worker pod is not restarting but the dag job yes. Could you share the configuration and how long it is taking?
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
@fmulero Where you able to solve this? I am having the same issue on version 2.3.2 and 2.3.3
@bdsoha no, still an issue. I think what's happening is k8 executor doesn't need http endpoint to report the status of job. we will need to modify the readiness probe with some kubectl command. I haven't got around to do that yet
You can increase the initial delay for the readiness probe setting the value worker.readinessProbe.initialDelaySeconds
in your values file or adding it with --set
in your helm
command
I needed to completely disable both the readiness
and liveness
probes from my chart.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
I needed to completely disable both the
readiness
andliveness
probes from my chart.
I use the same fix for now. In my case, I have a PythonOperator task that runs for more than 5 minutes.
The current readiness and liveness probes are TCP probes on the worker logs port (8793 by default). I am not sure but my understanding of the airflow doc (https://airflow.apache.org/docs/apache-airflow/2.3.3/logging-monitoring/logging-tasks.html#serving-logs-from-workers) is that the log server is not started when using the k8s executor.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Any update on this? It was moved from in progress to Solved?
I assume that there is nothing providing any liveness/readiness/health check for a pod running a task in kubernetes when we use KubernetesExecutor and PythonOperator?
I assume that there is nothing providing any liveness/readiness/health check for a pod running a task in kubernetes when we use KubernetesExecutor and PythonOperator?
So just disable readinessProbe
and livenessProbe
of workers in deployment?
When we use CeleryKubernetesExecutor
, we expect that:
celery-worker-pods
have readinessProbe
and livenessProbe
kubernetes-worker-pods
running with LocalExecutor
have no readinessProbe
and livenessProbe
, or respond to readinessProbe
and livenessProbe
Maybe we can customize the pod_template.yaml
to implement above.
But for convenience, we just disable all readinessProbe
and livenessProbe
, to support pod running a long-time task.
Hi @mujiannan
You can set liveness and readiness probes for worker nodes with these values: https://github.com/bitnami/charts/blob/bb98b43256eb4479bbc2623835e6aca070daa6bb/bitnami/airflow/values.yaml#L716-L745
Please note that the liveness and readiness logic depends on the executor you chose.
My company's k8s clusters had strict rules requiring all containers to have readinessProbe
and livenessProbe
enabled so disabling was not an option.
We resolved this by modifying the pod_template.yaml
to use an exec
probe instead of a tcpSocket
which I've recreated in my personal fork https://github.com/GavinColwell/charts/commit/66898a700a5f40b25d1a93c9f445d184397bdba6
You can also probably resolve it without modifying the chart directly by setting custom probes in your values files
worker:
customStartupProbe:
exec:
command:
- airlfow
- jobs
- check
- --local
- --job-type
- LocalTaskJob
customReadinessProbe:
exec:
command:
- airlfow
- jobs
- check
- --local
- --job-type
- LocalTaskJob
customLivenessProbe:
exec:
command:
- airlfow
- jobs
- check
- --local
- --job-type
- LocalTaskJob
Name and Version
bitnami/airflow 12.3.0
What steps will reproduce the bug?
Are you using any custom parameters or values?
No response
What is the expected behavior?
No response
What do you see instead?
Additional information
is it expected behaviour? should liveness and readiness probes on worker pods be disabled?