dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.63k stars 1.47k forks source link

K8sRunLauncher and QueuedRunCoordinator don't play well together #4311

Open Oliver-Sellwood opened 3 years ago

Oliver-Sellwood commented 3 years ago

Summary

From observations it seems that each call to K8s to launch a run takes >1s. Since these executions are launched in a simple for loop this imposes an artificial limit on the number of jobs that can be started in a period of time.

Reproduction

  1. Include these excerpts in your dagster-values.yml

    dagsterDaemon:
    enabled: true
    queuedRunCoordinator:
    enabled: true
    config: 
      max_concurrent_runs: 300

    and

    runLauncher:
    type: K8sRunLauncher
    config:
    k8sRunLauncher:
      jobNamespace: ~
      loadInclusterConfig: true
      kubeconfigFile: ~
      envConfigMaps: []
      envSecrets: []
  2. Schedule 300 runs of a noop pipeline
  3. Notice that it will take ~5 minutes to clear the backlog despite concurrency allowing for all jobs to be executed at once

Message from the maintainers:

Impacted by this bug? Give it a 👍. We factor engagement into prioritization.

b4dboi commented 1 year ago

Hi, I ran some benchmarks to verify how the QueuedRunCoordinator works after #11113 was merged and released in 1.1.7. The setup I used for the first test was following:

dequeueIntervalSeconds: 1
dequeueUseThreads: true
dequeueNumWorkers: 16

I was changing the number of concurrently scheduled jobs (every minute) and observed the delay it took to dequeue an enqueued run. As you can see in the following diagram the delay increases with offered load as one can expect. The question is whether the increase is acceptable with respect to the relatively low number of enqueued runs. The concurrency limits were not reached during the test. I would expect that the increase of delay will be noticeable somewhere at hundreds of runs, not dozens.

test_queue_delays_scaling_queue_delay

I also ran the test under constant number of scheduled runs (20), but I was changing the number of dequeue workers. As one can see from the diagram below the overall impact (at least with this number of concurrent jobs) of different number of workers is negligible.

test_queue_delays_scaling_20_jobs_1min_queue_delay_workers