Add a way to run each individual daemon in its own pod when using Dagster in production

gibsondan commented 2 years ago

What's the use case?

Better isolation between schedules/sensors/backfills/etc.

Ideas of implementation

Add a flag to the Helm chart that spins up multiple pods for the daemon (one for schedulers, one for sensors, etc.)

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

sryza commented 2 years ago

If the goal is performance, I wonder whether there's a better to way to split up daemons - e.g. a daemon per code location or something.

b4dboi commented 1 year ago

Hi, this is actually something, I see as an imperfection in the current dagster deployment schema. If I understood the architecture correctly, and my tests confirm it, the dagster-daemon is a potential single point of failure. The single instance can be simply stopped, blocked by some defective sensor or affected by network issues and so. I tried to overcome this by scaling the daemon deployment. However, if I run multiple instances of the daemon, I experience multiple sensor runs. I can imagine that isolation at user-code level (dedicated and isolated pods) and multiple instances (need for sync) would make the solution both more robust and performant. Any progress on this topic?

b4dboi commented 1 year ago

Ok, I run some more test to decode the execution model of sensors. I have to correct myself as the sensor is actually executed in the user-code container, which may scaled. However, the deployment of single dagster-daemon doesn't seem to be a robust solution.

b4dboi commented 1 year ago

If the goal is performance, I wonder whether there's a better to way to split up daemons - e.g. a daemon per code location or something.

I do not know details of the deamon exection model but my stress tests show that it affects badly scalability and overall performance. Using just dozens of sensors prolong the real execution interval from 5 seconds to minutes. Using the QueuedRunCoordinator to protect cluster from excessive number of jobs is limited in reality by 60 submitted jobs per minute. I would definitely vote fore splitting the daemon into several pieces and most likely in the mentioned way - e.g. deamon per code location. The execution of scheduler/sensor code actually works smoothly as it can be simple scaled as user deployment accessible via grpc and k8s service. This could be the way how to scale the daemon as well.

I did some research and collected all the performance related issues I was able to identify here https://dagster.slack.com/archives/C01U954MEER/p1670853819921249?thread_ts=1669205719.997049&cid=C01U954MEER

b4dboi commented 1 year ago

I like the way the Prefect split the execution model in worker queue and agents. Different worker queues are filled (even pre-filled) by tasks and agents (highly scalable) simple pick up the task and execute it. I know there is already a queue in the dagster this one is used as buffer before the run request is executed. I can imagine a situation when there are worker queues filled by daemon (minimal blocking) for sensors, schedulers, etc. and agents takes care of tasks - call sensor grpc api, call scheduler grpc api.

However, this might be over-engineering and simply use an asynchronous approach could help as well :)

dagster-io / dagster