databrickslabs / cicd-templates

Manage your Databricks deployments and CI with code.
Other
202 stars 100 forks source link

dbx launch --trace gets stuck for an hour #52

Closed chinwobble closed 3 years ago

chinwobble commented 3 years ago

We are trying the azure devops flavour of the ci cd pipeline.

when we get to this stage to run integration tests it hangs for over an hour.

- script: |
    dbx launch --job=cicd_demo_1-sample-integration-test --trace
    displayName: 'Launch integration on test'

Actual behaviour: We get the below error message repeated for over an hour until the azure pipeline times out (default is 1 hour)

Skipping this run because the limit of 1 maximum concurrent runs has been reached.

Expected behaviour: If there is concurrency run of the integration test job then the dbx launch --trace should either fail immediately or have some configurable retry window (retry ever minute for 5 mins) otherwise exit with an error.

renardeinside commented 3 years ago

Hi @chinwobble .

There are 3 behaviors when --trace is enabled:

 --existing-runs [pass (default), wait|cancel]  Strategy to handle existing active job runs.

the pass option (default one) will create a new run and trace its status. The wait one will wait for a current run and launch only afterward and the cancel one will cancel the current run and start a new one.

In your case, it seems that you're using the default pass option, and in your job config you either have specified the max_concurrent_runs to 1, or you haven't specified it at all, and it uses the default setting (also 1). To fix this issue, simply increase the allowed amount of concurrent runs in the job conf:

"max_concurrent_runs ": 10

However, I agree that a more flexible wait/retry schedule shall be provided in the launch command.

chinwobble commented 3 years ago

Thanks for the detailed reply. As you described we are using the following flags:

dbx launch --job=my_job --trace --existing-runs pass

Currently the sequence of events is this:

  1. An existing job is running and "max_concurrent_runs" is set to 1
  2. We run dbx launch --job=my_job --trace --existing-runs pass to create a new job
  3. databricks API will take the job request and skip it immediately to prevent unwanted concurrency
  4. The dbx launch with --trace will continuously report that the job has been skipped.

During step 4, there is nothing to trace since no amount of waiting will change the status of that job run. Its at the end the state machine and the dbx utility to exit immediately when it sees the skipped job run status.

renardeinside commented 3 years ago

Thanks for a detailed explanation, now I got the problem. Seems like we need to add some additional checks for --trace behavior

renardeinside commented 3 years ago

Hi @chinwobble , please use the dbx package from the latest release.