Closed Jagadeepkosuri closed 2 years ago
It seems it is not an issue related to the Bitnami Airflow container image or Helm chart but about how the application or environment is being used/configured.
For information regarding the application itself, customization of the content within the application, or questions about the use of technology or infrastructure; we highly recommend checking forums and user guides made available by the project behind the application or the technology.
That said, we will keep this ticket open until the stale bot closes it just in case someone from the community adds some valuable info.
@Jagadeepkosuri check if you coredns or any dns in k8s pod is running. if dns service is running try spinning up a busybox pod and ping other service to test if they are reachable
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Name and Version
bitnami/airflow10.3.1
What steps will reproduce the bug?
Install the Airflow along with PostgreSQL DB on Kubernetes. (I'm using Azure Kubernetes Service)
Are you using any custom parameters or values?
What is the expected behavior?
Airflow DAGs shouldn't be failing with the postgres name resolution errors. I was of the assumption that running Kubernetes Executor with high parallelism is opening many database connections or overwhelming the Core DNS. So, switched to running 30 Celery Worker pods, but still facing the same issue.
What do you see instead?
I face the below error sporadically. Initially, this was happening in Model jobs only, i.e., DAG which would trigger other child dags. The task in the Model job fails with the below error, but the child dags triggered by the failed task in the Model job runs without failing. Now, this happens in individual dags as well.
Additional information
Can someone help me solve this issue quickly, as this is failing our jobs in production?