K8sRunLauncher and QueuedRunCoordinator don't play well together

dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.

Apache License 2.0

11.63k stars 1.47k forks source link

Summary

From observations it seems that each call to K8s to launch a run takes >1s. Since these executions are launched in a simple for loop this imposes an artificial limit on the number of jobs that can be started in a period of time.

Reproduction

Include these excerpts in your dagster-values.yml

dagsterDaemon:
enabled: true
queuedRunCoordinator:
enabled: true
config: 
  max_concurrent_runs: 300

and

runLauncher:
type: K8sRunLauncher
config:
k8sRunLauncher:
  jobNamespace: ~
  loadInclusterConfig: true
  kubeconfigFile: ~
  envConfigMaps: []
  envSecrets: []

Schedule 300 runs of a noop pipeline
Notice that it will take ~5 minutes to clear the backlog despite concurrency allowing for all jobs to be executed at once

Message from the maintainers:

Impacted by this bug? Give it a 👍. We factor engagement into prioritization.

Hi, I ran some benchmarks to verify how the QueuedRunCoordinator works after #11113 was merged and released in 1.1.7. The setup I used for the first test was following:

dequeueIntervalSeconds: 1
dequeueUseThreads: true
dequeueNumWorkers: 16

I was changing the number of concurrently scheduled jobs (every minute) and observed the delay it took to dequeue an enqueued run. As you can see in the following diagram the delay increases with offered load as one can expect. The question is whether the increase is acceptable with respect to the relatively low number of enqueued runs. The concurrency limits were not reached during the test. I would expect that the increase of delay will be noticeable somewhere at hundreds of runs, not dozens.

test_queue_delays_scaling_queue_delay

I also ran the test under constant number of scheduled runs (20), but I was changing the number of dequeue workers. As one can see from the diagram below the overall impact (at least with this number of concurrent jobs) of different number of workers is negligible.

test_queue_delays_scaling_20_jobs_1min_queue_delay_workers

dagster-io / dagster