Open Oliver-Sellwood opened 3 years ago
Hi, I ran some benchmarks to verify how the QueuedRunCoordinator works after #11113 was merged and released in 1.1.7. The setup I used for the first test was following:
dequeueIntervalSeconds: 1
dequeueUseThreads: true
dequeueNumWorkers: 16
I was changing the number of concurrently scheduled jobs (every minute) and observed the delay it took to dequeue an enqueued run. As you can see in the following diagram the delay increases with offered load as one can expect. The question is whether the increase is acceptable with respect to the relatively low number of enqueued runs. The concurrency limits were not reached during the test. I would expect that the increase of delay will be noticeable somewhere at hundreds of runs, not dozens.
I also ran the test under constant number of scheduled runs (20), but I was changing the number of dequeue workers. As one can see from the diagram below the overall impact (at least with this number of concurrent jobs) of different number of workers is negligible.
Summary
From observations it seems that each call to K8s to launch a run takes >1s. Since these executions are launched in a simple for loop this imposes an artificial limit on the number of jobs that can be started in a period of time.
Reproduction
Include these excerpts in your
dagster-values.yml
and
Message from the maintainers:
Impacted by this bug? Give it a 👍. We factor engagement into prioritization.