bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.94k stars 9.18k forks source link

psycopg2.OperationalError: could not translate host name "airflow-release-postgresql" to address: Temporary failure in name resolution #10890

Closed Jagadeepkosuri closed 2 years ago

Jagadeepkosuri commented 2 years ago

Name and Version

bitnami/airflow10.3.1

What steps will reproduce the bug?

Install the Airflow along with PostgreSQL DB on Kubernetes. (I'm using Azure Kubernetes Service)

Are you using any custom parameters or values?

AIRFLOW__CORE__DAGBAG_IMPORT_ERROR_TRACEBACK_DEPTH: '10' 
AIRFLOW__CORE__ENABLE_XCOM_PICKLING: 'True' 
AIRFLOW__CORE__PARALLELISM: '512' 
AIRFLOW__CORE__DAG_CONCURRENCY: '100'
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: '1'
AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT: '3000.0'
AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT: '5000'
AIRFLOW__CORE__STORE_DAG_CODE: 'True'
AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE: '30'
AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW: '60'
AIRFLOW__CELERY__WORKER_CONCURRENCY: '16'
AIRFLOW__WEBSERVER__EXPOSE_CONFIG: 'True'
AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE: 'True'
AIRFLOW__WEBSERVER__WORKERS: '16'
AIRFLOW__WEBSERVER__WORKER_CLASS: 'gevent'
AIRFLOW__WEBSERVER__PAGE_SIZE: '50'
AIRFLOW__WEBSERVER__SHOW_RECENT_STATS_FOR_COMPLETED_RUNS: 'False'
AIRFLOW__SCHEDULER__CATCHUP_BY_DEFAULT: 'False'
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: '60'
AIRFLOW__SCHEDULER__PARSING_PROCESSES: '8'
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: '100'
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: '120'
AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: '360'
AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE: '1024'

What is the expected behavior?

Airflow DAGs shouldn't be failing with the postgres name resolution errors. I was of the assumption that running Kubernetes Executor with high parallelism is opening many database connections or overwhelming the Core DNS. So, switched to running 30 Celery Worker pods, but still facing the same issue.

What do you see instead?

I face the below error sporadically. Initially, this was happening in Model jobs only, i.e., DAG which would trigger other child dags. The task in the Model job fails with the below error, but the child dags triggered by the failed task in the Model job runs without failing. Now, this happens in individual dags as well.

[2022-06-27 14:38:30,574] {taskinstance.py:1501} ERROR - Task failed with exception Traceback (most recent call last): File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect return fn() File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 364, in connect return _ConnectionFairy._checkout(self) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout fairy = _ConnectionRecord.checkout(pool) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout rec = pool._do_get() File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 241, in _do_get return self._create_connection() File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection return _ConnectionRecord(self) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__ self.__connect(first_connect_check=True) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect pool.logger.debug("Error on connect(): %s", e) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__ compat.raise_( File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_ raise exception File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect connection = pool._invoke_creator(self) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect return dialect.connect(*cargs, **cparams) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect return self.dbapi.connect(*cargs, **cparams) File "/opt/bitnami/airflow/venv/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: could not translate host name "airflow-release-postgresql" to address: Temporary failure in name resolution

Additional information

Can someone help me solve this issue quickly, as this is failing our jobs in production?

carrodher commented 2 years ago

It seems it is not an issue related to the Bitnami Airflow container image or Helm chart but about how the application or environment is being used/configured.

For information regarding the application itself, customization of the content within the application, or questions about the use of technology or infrastructure; we highly recommend checking forums and user guides made available by the project behind the application or the technology.

That said, we will keep this ticket open until the stale bot closes it just in case someone from the community adds some valuable info.

rc-coderepo commented 2 years ago

@Jagadeepkosuri check if you coredns or any dns in k8s pod is running. if dns service is running try spinning up a busybox pod and ping other service to test if they are reachable

github-actions[bot] commented 2 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 2 years ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.