Open WesRoach opened 3 years ago
Hi @WesRoach - we actually added some more logging around this in the 0.11.4 release that went out yesterday. If upgrading is an option I'd be curious what some updated logs look like (in particular, this would tell us if this happened because a thread started hanging, or because the daemon process stopped writing heartbeats to the database for some reason).
Is it possible that you're running into some kind of resource limit on your machine that might be causing threads to shut down or hang? In the absence of some kind of error in the threads that are no longer running, that would be my first guess.
While we should definitely figure out the root cause here, you may also went to set up your daemon process so that it automatically restarts when it stops (if you're running locally, something like supervisord can help with this) - it's designed to be able to catch up where it left off if it stops for any reason.
@gibsondan Thanks for quick response. I'll have to wait until 0.11.4 lands on conda-forge, but I'll test it out once that version pops up (latest available is 0.11.3).
That's possible - it's a multi-user server. It's not uncommon for CPU utilization to sit close to 100% for hours at a time. However, the traceback I shared was from today while I was monitoring this specific issue, and with near 0% CPU utilization.
supervisord looks interesting - I will look into it.
Got it - the other possibility is that for some reason the heartbeat database writes aren't happening (every 30 seconds or so each thread should be writing to your storage and saying they're still running). I'd expect there to be some kind of error in the logs if that was the problem though.
I'm getting this issue as well, but with just SENSOR
and consistently about four minutes after starting the daemon process.
It's version 0.11.4
but I don't see any extra logs.
2021-04-16 08:27:01 - dagster-daemon - INFO - instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon']
2021-04-16 08:27:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:27:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:27:02 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:02 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:16 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:16 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:26 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:26 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:27:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:27:36 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:36 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:46 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:46 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:27:56 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
Polling.
2021-04-16 08:27:56 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
2021-04-16 08:28:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:28:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:28:07 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
2021-04-16 08:28:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:28:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:29:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:29:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:29:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:29:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:30:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:30:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:30:31 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:30:31 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:31:01 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:31:01 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
Polling.
2021-04-16 08:31:09 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping
Traceback (most recent call last):
File "/home/nkhanna/miniconda/envs/redpanda/bin/dagster-daemon", line 8, in <module>
sys.exit(main())
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 126, in main
cli(obj={}) # pylint:disable=E1123
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 39, in run_command
controller.check_daemon_loop()
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 149, in check_daemon_loop
self.check_daemon_heartbeats()
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 130, in check_daemon_heartbeats
raise Exception(
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR']
Note how the final sensor check at 08:28:07
seems to fail - a few minutes later it times out and crashes.
Interestingly: if I explicitly yield a SkipReason, this happens:
2021-04-16 08:46:07 - SensorDaemon - INFO - Checking for new runs for sensor: foo_kafka_sensor
2021-04-16 08:46:32 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:46:32 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:47:02 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:47:02 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:47:32 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:47:32 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:48:02 - BackfillDaemon - INFO - No backfill jobs requested.
2021-04-16 08:48:02 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:48:32 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started.
2021-04-16 08:48:32 - BackfillDaemon - INFO - No backfill jobs requested.
Polling.
2021-04-16 08:49:10 - SensorDaemon - INFO - Sensor returned false for foo_kafka_sensor, skipping: No messages received from Kafka.
Traceback (most recent call last):
File "/home/nkhanna/miniconda/envs/redpanda/bin/dagster-daemon", line 8, in <module>
sys.exit(main())
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 126, in main
cli(obj={}) # pylint:disable=E1123
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/cli/__init__.py", line 39, in run_command
controller.check_daemon_loop()
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 149, in check_daemon_loop
self.check_daemon_heartbeats()
File "/home/nkhanna/miniconda/envs/redpanda/lib/python3.8/site-packages/dagster/daemon/controller.py", line 130, in check_daemon_heartbeats
raise Exception(
Exception: Stopping dagster-daemon process since the following threads are no longer sending heartbeats: ['SENSOR']
(namely: it yields that skipreason immediately before crashing)
Hi @NakulK48 - would it be possible to share the code of foo_kafka_sensor? It seems like something is causing the thread to hang.
consumer = KafkaConsumer("nakul1", group_id="group-nk")
PARENT_DIR = Path(__file__).parent
@sensor("foo_pipeline", minimum_interval_seconds=10)
def foo_kafka_sensor(context):
with open(PARENT_DIR / "pipeline.yaml") as yaml_file:
base_config = yaml.safe_load(yaml_file)
print_word_inputs = base_config["solids"]["print_word"]["inputs"]
poll_result = consumer.poll()
print("Polling.")
messages = next(iter(poll_result.values()), [])
if not messages:
yield SkipReason("No messages received from Kafka.")
for message in messages:
inner_message = message.value.decode("utf8")
print_word_inputs["word"] = {"value": inner_message}
request = RunRequest(run_key=inner_message, run_config=base_config)
yield request
What's odd is that it works fine the first 5-6 times and then stops.
[n.b. if there's a better way of triggering from Kafka do let me know!]
Hm, is it possible that consumer.poll() is hanging for some reason? If you add a log before the poll() as well, it could give a clue about where it's getting stuck.
If that's the case, the daemon clearly needs to have a timeout on the sensor call (rather than crashing whenever an individual sensor takes too long to execute)
I was able to reproduce this behavior with a sensor that hangs/runs in an infinite loop but never finishes. We'll add some better handling for that case in the daemon, but ultimately the hanging behavior is likely in the sensor code itself.
Yes, it does look like the kafka
poll itself is what hangs... I'll look into that, but yup, it would be good if dagster
explicitly logged that sensor execution had timed out.
Also - context.log
doesn't seem to exist on SensorExecutionContext
- that might be handy to have available.
I am having the same issue, except I am not using any sensors. It appears that SensorDaemon stops writing heartbeats, and dagster-daemon crashes about 4 minutes later. I have looked through the logs and don't see anything about the SensorDaemon.
I am using Dagster 0.11.6, and running everything locally on my MacBook Pro 16". Specs:
I'm experiencing the same issue here on a Macbook Pro 16" (macOS 11.4, 2,3ghz 8 core i9, 32GB ram) with Dagster 0.12.13. From the timing of the last successful checks for new runs, I suspect this may have to do with OSx going to hibernate. I will monitor for a bit. @likethebourbon, could this be the case as well for you?
Yeah, macOS hibernating was the problem for me. I ended up running Dagster through supervisord as recommended above, which was good enough for my needs.
Summary
Launching
dagster-daemon
as follows on Linux in the same directory as myworkspace.yaml
:The daemon appears to be shutting itself down? In at least one edge case it died after 2 minutes, in others it stays alive for days before producing this traceback:
Traceback
```bash 2021-04-09 10:33:01 - dagster-daemon - INFO - instance is configured with the following daemons: ['BackfillDaemon', 'SchedulerDaemon', 'SensorDaemon'] 2021-04-09 10:34:56 - SchedulerDaemon - INFO - Not checking for any runs since no schedules have been started. 2021-04-09 10:34:59 - SensorDaemon - INFO - Checking for new runs for sensor: mvp_pipeline_sensor 2021-04-09 10:34:59 - BackfillDaemon - INFO - No backfill jobs requested. 2021-04-09 10:35:00 - SensorDaemon - INFO - Sensor returned false for mvp_pipeline_sensor, skipping: No un-run trigger found. Traceback (most recent call last): File "/u02/users/wroach/.conda/envs/hcenv/bin/dagster-daemon", line 11, inAdditional Info about Your Environment
Using a Conda Environment including these Dagster packages:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We factor engagement into prioritization.