getsentry / airflow-metrics

Metrics for airflow
Apache License 2.0
14 stars 6 forks source link

Scheduler crashing #44

Open maxcountryman opened 5 years ago

maxcountryman commented 5 years ago

It seems like our scheduler is crashing since adding this plugin.

Here's a truncated traceback:

...
2019-09-05T15:29:42.109934+00:00 app[scheduler.1]: File "/app/plugins/airflow_metrics/airflow_metrics/patch_thread.py", line 56, in wrapped
2019-09-05T15:29:42.109935+00:00 app[scheduler.1]: func(utc.localize(since))
2019-09-05T15:29:42.109936+00:00 app[scheduler.1]: File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 72, in wrapper
2019-09-05T15:29:42.109937+00:00 app[scheduler.1]: with create_session() as session:
2019-09-05T15:29:42.109938+00:00 app[scheduler.1]: File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__
2019-09-05T15:29:42.109939+00:00 app[scheduler.1]: return next(self.gen)
2019-09-05T15:29:42.109940+00:00 app[scheduler.1]: File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 42, in create_session
2019-09-05T15:29:42.109941+00:00 app[scheduler.1]: session = settings.Session()
2019-09-05T15:29:42.109942+00:00 app[scheduler.1]: TypeError: 'NoneType' object is not callable
mikeclarke commented 5 years ago

@maxcountryman hm we haven't seen this, which version of Airflow are you running? I assume the problem goes away when airflow_metrics isn't on the plugin path?

maxcountryman commented 5 years ago

We're using the latest Airflow: 1.10.5. And yes, removing the plugin seems to have stopped the scheduler crashes.

Zylphrex commented 5 years ago

That's peculiar. The only line that touches settings is here https://github.com/getsentry/airflow-metrics/blob/2f56cc6d0eb125a113f6511153d439f40ddbc50b/airflow_metrics/airflow_metrics/patch_stats.py#L34 and that shouldn't cause settings.Session to become None.

Would you mind posting a more verbose stack trace to hopefully shed some more light on this issue?

Edit: I also found this issue https://issues.apache.org/jira/browse/AIRFLOW-2363 and in their case, it seems like a different error caused settings.Session to become None

maxcountryman commented 5 years ago

Unfortunately that's the only traceback I have and now that we've removed the plugin the scheduler isn't crashing.

maxcountryman commented 4 years ago

We tried adding this again and consistently saw it re-introduce the crashed scheduler issue. This is a complete nonstarter for us so it would be great if someone could look into this more.

Also frustratingly there are no logs coming out of the scheduler process, it just dies.

Zylphrex commented 4 years ago

If there are no logs, would you mind providing some more details regarding how your airflow is set up? If you can reproduce the crash in an environment I can run, that'd be even better. As it stands, there's no way for me to know what the problem is.

grohan2002 commented 4 years ago

Hello Team we are facing a similar issue, we had deployed airflow on a Kubernetes cluster and added this plugin, where the airflow scheduler stopped working.

grohan2002 commented 4 years ago

<TaskInstance: POSTGRESQL_S3_DATA_SYNC_DAG.postgresql_s3_data_sync_task 2020-01-07 01:10:00+00:00 [scheduled]> [2020-01-08 14:51:55,438] {{scheduler_job.py:1112}} INFO - Setting the following 1 tasks to queued state: <TaskInstance: POSTGRESQL_S3_DATA_SYNC_DAG.postgresql_s3_data_sync_task 2020-01-07 01:10:00+00:00 [queued]> [2020-01-08 14:51:55,438] {{scheduler_job.py:1148}} INFO - Sending ('POSTGRESQL_S3_DATA_SYNC_DAG', 'postgresql_s3_data_sync_task', datetime.datetime(2020, 1, 7, 1, 10, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) to executor with priority 1 and queue default [2020-01-08 14:51:55,438] {{base_executor.py:59}} INFO - Adding to queue: ['airflow', 'run', 'POSTGRESQL_S3_DATA_SYNC_DAG', 'postgresql_s3_data_sync_task', '2020-01-07T01:10:00+00:00', '--local', '--pool', 'default_pool', '-sd', '/usr/local/airflow/dags/data_pipeline_dags.zip'] [2020-01-08 14:51:55,656] {{scheduler_job.py:1288}} INFO - Executor reports execution of POSTGRESQL_S3_DATA_SYNC_DAG.postgresql_s3_data_sync_task execution_date=2020-01-07 01:05:00+00:00 exited with status success for try_number 1 [2020-01-08 14:51:56,195] {{patch_requests.py:33}} WARNING - Found blacklisted domain: api.datadoghq.com [2020-01-08 14:51:56,199] {{api_client.py:166}} INFO - 202 POST https://api.datadoghq.com/api/v1/series (6.3708ms) [2020-01-08 14:51:56,428] {{datadog_logger.py:41}} INFO - datadog gauge: task.state 1 1 False {'state': 'queued'} [2020-01-08 14:51:56,428] {{datadog_logger.py:41}} INFO - datadog gauge: task.state 2 1 False {'state': 'running'} [2020-01-08 14:51:56,429] {{datadog_logger.py:41}} INFO - datadog gauge: task.state 14 1 False {'state': 'success'} [2020-01-08 14:51:57,531] {{datadog_logger.py:31}} INFO - datadog incr: scheduler_heartbeat 1 1 None [2020-01-08 14:52:03,559] {{datadog_logger.py:31}} INFO - datadog incr: scheduler_heartbeat 1 1 None [2020-01-08 14:52:06,203] {{patch_requests.py:33}} WARNING - Found blacklisted domain: api.datadoghq.com [2020-01-08 14:52:06,210] {{api_client.py:166}} INFO - 202 POST https://api.datadoghq.com/api/v1/series (10.215ms) [2020-01-08 14:52:06,446] {{datadog_logger.py:41}} INFO - datadog gauge: task.state 3 1 False {'state': 'running'} [2020-01-08 14:52:06,446] {{datadog_logger.py:41}} INFO - datadog gauge: task.state 14 1 False {'state': 'success'} [2020-01-08 14:52:09,575] {{datadog_logger.py:31}} INFO - datadog incr: scheduler_heartbeat 1 1 None [2020-01-08 14:52:15,560] {{datadog_logger.py:31}} INFO - datadog incr: scheduler_heartbeat 1 1 None [2020-01-08 14:52:16,213] {{patch_requests.py:33}} WARNING - Found blacklisted domain: api.datadoghq.com

maxcountryman commented 4 years ago

@mikeclarke @Zylphrex would be great if someone could take a look--seems like others are having similar issues.