dagster-daemon Exception: Stopping dagster-daemon ... threads are no longer sending heartbeats: ['SENSOR', 'BACKFILL']

WesRoach commented 3 years ago

Summary

Launching dagster-daemon as follows on Linux in the same directory as my workspace.yaml:

nohup dagster-daemon run &

The daemon appears to be shutting itself down? In at least one edge case it died after 2 minutes, in others it stays alive for days before producing this traceback:

Traceback

```bash 2021-04-09 10:33:01 - dagster-daemon - INFO - instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon'] 2021-04-09 10:34:56 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started. 2021-04-09 10:34:59 - SensorDaemon - INFO - Checking for new runs for sensor: mvp_pipeline_sensor 2021-04-09 10:34:59 - BackfillDaemon - INFO - No backfill jobs requested. 2021-04-09 10:35:00 - SensorDaemon - INFO - Sensor returned false for mvp_pipeline_sensor, skipping: No un-run trigger found. Traceback (most recent call last): File "/u02/users/wroach/.conda/envs/hcenv/bin/dagster-daemon", line 11, in sys.exit(main()) File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/dagster/daemon/cli/__init__.py", line 139, in main cli(obj={}) # pylint:disable=E1123 File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/dagster/daemon/cli/__init__.py", line 51, in run_command controller.check_daemons() File "/u02/users/wroach/.conda/envs/hcenv/lib/python3.7/site-packages/dagster/daemon/controller.py", line 108, in check_daemons failed_daemons=failed_daemons Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR', 'BACKFILL'] ```

Additional Info about Your Environment

Red Hat Enterprise Linux Server 7.9
4 Cores / 64GB Ram
EC2

Using a Conda Environment including these Dagster packages:

dagster                   0.11.2           py37h89c1867_1    conda-forge
dagster-graphql           0.11.2             pyhd8ed1ab_1    conda-forge
dagster-pyspark           0.11.2             pyhd8ed1ab_1    conda-forge
dagster-spark             0.11.2             pyhd8ed1ab_1    conda-forge

Message from the maintainers:

Impacted by this bug? Give it a 👍. We factor engagement into prioritization.

gibsondan commented 3 years ago

Hi @WesRoach - we actually added some more logging around this in the 0.11.4 release that went out yesterday. If upgrading is an option I'd be curious what some updated logs look like (in particular, this would tell us if this happened because a thread started hanging, or because the daemon process stopped writing heartbeats to the database for some reason).

Is it possible that you're running into some kind of resource limit on your machine that might be causing threads to shut down or hang? In the absence of some kind of error in the threads that are no longer running, that would be my first guess.

While we should definitely figure out the root cause here, you may also went to set up your daemon process so that it automatically restarts when it stops (if you're running locally, something like supervisord can help with this) - it's designed to be able to catch up where it left off if it stops for any reason.

WesRoach commented 3 years ago

@gibsondan Thanks for quick response. I'll have to wait until 0.11.4 lands on conda-forge, but I'll test it out once that version pops up (latest available is 0.11.3).

That's possible - it's a multi-user server. It's not uncommon for CPU utilization to sit close to 100% for hours at a time. However, the traceback I shared was from today while I was monitoring this specific issue, and with near 0% CPU utilization.

supervisord looks interesting - I will look into it.

gibsondan commented 3 years ago

Got it - the other possibility is that for some reason the heartbeat database writes aren't happening (every 30 seconds or so each thread should be writing to your storage and saying they're still running). I'd expect there to be some kind of error in the logs if that was the problem though.

NakulK48 commented 3 years ago

I'm getting this issue as well, but with just SENSOR and consistently about four minutes after starting the daemon process.

It's version 0.11.4 but I don't see any extra logs.

2021-04-16 08:27:01 - dagster-daemon - INFO - instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']
2021-04-16 08:27:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:27:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:27:02 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:02 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:16 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:16 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:26 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:26 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:27:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:27:36 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:36 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:46 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:46 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:56 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:56 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:28:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:28:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:28:07 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
2021-04-16 08:28:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:28:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:29:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:29:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:29:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:29:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:30:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:30:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:30:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:30:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:31:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:31:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
Polling.
2021-04-16 08:31:09 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
Traceback (most recent call last):
  File "/home/nkhanna/miniconda/envs/redpanda/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 126, in main
    cli(obj={})  # pylint:disable=E1123
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 39, in run_command
    controller.check_daemon_loop()
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 149, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 130, in check_daemon_heartbeats
    raise Exception(
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR']

Note how the final sensor check at 08:28:07 seems to fail - a few minutes later it times out and crashes.

NakulK48 commented 3 years ago

Interestingly: if I explicitly yield a SkipReason, this happens:

2021-04-16 08:46:07 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
2021-04-16 08:46:32 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:46:32 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:47:02 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:47:02 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:47:32 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:47:32 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:48:02 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:48:02 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:48:32 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:48:32 - BackfillDaemon - INFO - No backfill jobs requested.
Polling.
2021-04-16 08:49:10 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping: No messages received from Kafka.
Traceback (most recent call last):
  File "/home/nkhanna/miniconda/envs/redpanda/bin/dagster-daemon", line 8, in <module>
    sys.exit(main())
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 126, in main
    cli(obj={})  # pylint:disable=E1123
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 39, in run_command
    controller.check_daemon_loop()
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 149, in check_daemon_loop
    self.check_daemon_heartbeats()
  File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 130, in check_daemon_heartbeats
    raise Exception(
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR']

(namely: it yields that skipreason immediately before crashing)

gibsondan commented 3 years ago

Hi @NakulK48 - would it be possible to share the code of foo_kafka_sensor? It seems like something is causing the thread to hang.

NakulK48 commented 3 years ago

consumer = KafkaConsumer("nakul1", group_id="group-nk")

PARENT_DIR = Path(__file__).parent

@sensor("foo_pipeline", minimum_interval_seconds=10)
def foo_kafka_sensor(context):

    with open(PARENT_DIR / "pipeline.yaml") as yaml_file:
        base_config = yaml.safe_load(yaml_file)
    print_word_inputs = base_config["solids"]["print_word"]["inputs"]

    poll_result = consumer.poll()
    print("Polling.")
    messages = next(iter(poll_result.values()), [])

    if not messages:
        yield SkipReason("No messages received from Kafka.")

    for message in messages:
        inner_message = message.value.decode("utf8")
        print_word_inputs["word"] = {"value": inner_message}
        request = RunRequest(run_key=inner_message, run_config=base_config)
        yield request

What's odd is that it works fine the first 5-6 times and then stops.

[n.b. if there's a better way of triggering from Kafka do let me know!]

gibsondan commented 3 years ago

Hm, is it possible that consumer.poll() is hanging for some reason? If you add a log before the poll() as well, it could give a clue about where it's getting stuck.

If that's the case, the daemon clearly needs to have a timeout on the sensor call (rather than crashing whenever an individual sensor takes too long to execute)

gibsondan commented 3 years ago

I was able to reproduce this behavior with a sensor that hangs/runs in an infinite loop but never finishes. We'll add some better handling for that case in the daemon, but ultimately the hanging behavior is likely in the sensor code itself.

NakulK48 commented 3 years ago

Yes, it does look like the kafka poll itself is what hangs... I'll look into that, but yup, it would be good if dagster explicitly logged that sensor execution had timed out.

NakulK48 commented 3 years ago

Also - context.log doesn't seem to exist on SensorExecutionContext - that might be handy to have available.

likethebourbon commented 3 years ago

I am having the same issue, except I am not using any sensors. It appears that SensorDaemon stops writing heartbeats, and dagster-daemon crashes about 4 minutes later. I have looked through the logs and don't see anything about the SensorDaemon.

I am using Dagster 0.11.6, and running everything locally on my MacBook Pro 16". Specs:

macOS 10.15.17
2.6 GHz 6-core Intel Core i7
64 GB ram

Here's a recent traceback:

```2021-04-27 13:54:10 - SensorDaemon - INFO - Not checking for any runs since no sensors have been started. 2021-04-27 13:54:15 - SensorDaemon - INFO - Not checking for any runs since no sensors have been started. 2021-04-27 13:54:20 - SensorDaemon - INFO - Not checking for any runs since no sensors have been started. 2021-04-27 13:58:50 - BackfillDaemon - INFO - No backfill jobs requested. 2021-04-27 13:58:51 - SchedulerDaemon - INFO - Checking for new runs for the following schedules: email_trent_schedule, etl_schedule, email_profs_schedule 2021-04-27 13:58:53 - SchedulerDaemon - INFO - No new runs for email_trent_schedule Traceback (most recent call last): File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/bin/dagster-daemon", line 8, in sys.exit(main()) File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 126, in main cli(obj={}) # pylint:disable=E1123 File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 39, in run_command controller.check_daemon_loop() File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/dagster/daemon/controller.py", line 192, in check_daemon_loop self.check_daemon_heartbeats() File "/Users/ew648/Documents/ml_projects/qr_codes/.venv/lib/python3.8/site-packages/dagster/daemon/controller.py", line 173, in check_daemon_heartbeats raise Exception( Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR', 'SCHEDULER'] ```

mboogerd commented 2 years ago

I'm experiencing the same issue here on a Macbook Pro 16" (macOS 11.4, 2,3ghz 8 core i9, 32GB ram) with Dagster 0.12.13. From the timing of the last successful checks for new runs, I suspect this may have to do with OSx going to hibernate. I will monitor for a bit. @likethebourbon, could this be the case as well for you?

likethebourbon commented 2 years ago

Yeah, macOS hibernating was the problem for me. I ended up running Dagster through supervisord as recommended above, which was good enough for my needs.

dagster-io / dagster