databricks / cli

Databricks CLI
Other
132 stars 50 forks source link

`mode: production` target appears bugged during GitHub action deployment for v0.222 (`poetry` build) #1542

Open arcaputo3 opened 3 months ago

arcaputo3 commented 3 months ago

Describe the issue

Deploying via DABs through GitHub actions fails within a production target for CLI v0.222.0.

Configuration

Workflow file:

# This workflow validates, deploys, and runs the specified bundle
# within a production target named "prod".
name: "Prod deployment"

# Trigger this workflow whenever a pull request is pushed to the repo's
# main branch.
on:
  push:
    branches:
      - main

jobs:
  deploy:
    name: "Deploy bundle"
    runs-on: ubuntu-latest

    steps:
      # Check out this repo, so that this workflow can access it.
      - uses: actions/checkout@v4

      # Download the Databricks CLI.
      # See https://github.com/databricks/setup-cli
      - uses: databricks/setup-cli@main

      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install poetry
          poetry install --all-extras
      # Deploy the bundle to the "prod" target as defined
      # in the bundle's settings file.
      - run: databricks bundle deploy --debug
        working-directory: .
        env:
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
          DATABRICKS_BUNDLE_ENV: prod

Steps to reproduce the behavior

  1. Run databricks bundle deploy -t <prod-target> via a GitHub action using the above workflow file.

Expected Behavior

Deployment should execute properly for any mode.

Actual Behavior

Deployment fails for mode: production and works for mode: development.

OS and CLI version

CLI: v0.222.0 OS: Ubuntu 22.04.4

Running on Windows 11 locally works fine for CLI v0.222.0.

Is this a regression?

Yes, using databricks/setup-cli@v0.221.1 fixes the issue.

Debug Logs

Run databricks bundle deploy --debug
22:25:48  INFO start pid=1662 version=0.222.0 args="databricks, bundle, deploy, --debug"
22:25:48 DEBUG Found bundle root at /home/runner/work/tjc-databricks/tjc-databricks (file /home/runner/work/tjc-databricks/tjc-databricks/databricks.yml) pid=1662
22:25:48 DEBUG Apply pid=1662 mutator=load
22:25:48  INFO Phase: load pid=[16](https://github.com/TJC-LP/tjc-databricks/actions/runs/9719193100/job/26828536724#step:6:17)62 mutator=load
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=EntryPoint
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=scripts.preinit
22:25:48 DEBUG No script defined for preinit, skipping pid=1662 mutator=load mutator=seq mutator=scripts.preinit
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq
[22](https://github.com/TJC-LP/tjc-databricks/actions/runs/9719193100/job/26828536724#step:6:23):25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/Excel_files/dabs_xlsx_job.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/PDF_files/pdf_processing_dabs_job.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/amazon_reviews/resources.yml)
22:[25](https://github.com/TJC-LP/tjc-databricks/actions/runs/9719193100/job/26828536724#step:6:26):48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/capiq/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/capital_markets/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/data_rooms/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/deal_database/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/firehose/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/investor_relations/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/isg/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/omg/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=ProcessRootIncludes mutator=seq mutator=ProcessInclude(workflows/tax/resources.yml)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=VerifyCliVersion
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=EnvironmentsToTargets
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=InitializeVariables
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=DefineDefaultTarget(default)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=LoadGitDetails
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=PythonMutator(load)
22:25:48 DEBUG Apply pid=1662 mutator=load mutator=seq mutator=SelectTarget(prod)
22:25:48 ERROR Error: cannot merge int with string pid=1662 mutator=load mutator=seq mutator=SelectTarget(prod)
22:25:48 ERROR Error: cannot merge int with string pid=1662 mutator=load mutator=seq
22:25:48 ERROR Error: cannot merge int with string pid=1662 mutator=load
Error: cannot merge int with string
22:25:48 ERROR failed execution pid=1662 exit_code=1 error="cannot merge int with string"
Error: Process completed with exit code 1.
arcaputo3 commented 3 months ago

Note that destroying the target and/or manually deleting the .bundle and retrying yields the same issue.

pietern commented 3 months ago

Thanks for reporting this issue. Can you share (a snippet of) your bundle configuration?

We didn't change the merge logic that affects these code paths so I expect an issue upstream.

Being able to reproduce this would be very helpful.

arcaputo3 commented 3 months ago

Sure, please see our databricks.yml below. Our sub ymls are purely representative of workflows.

# This is a Databricks asset bundle definition for tjc_databricks.
# See [REDACTED] for documentation.
bundle:
  name: tjc-databricks
  git:
    origin_url: [REDACTED]
    # branch: main

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

include:
  - workflows/*/*.yml

variables:
  environment:
    description: The environment of the workflow
    default: dev
  principal_user:
    description: The principal user to run in production
    default: [REDACTED]
  tjc_excelsior_version:
    description: The version of `tjc-excelsior` to use
    default: 1.0.8
  tika_ocr_version:
    description: The version of `tika-ocr` to use
    default: 0.1.6
  pause_status:
    description: The status of scheduling for jobs. Only unpauses for prod.
    default: PAUSED
  pause_status_file_sync:
    description: The status of allowing file notifications. Only pauses for dev.
    default: UNPAUSED
  limit:
    description: The limit to use for testing
    default: 10

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: [REDACTED]
    variables:
      environment: dev
      pause_status_file_sync: PAUSED

  test:
    mode: development
    workspace:
      host: [REDACTED]
      root_path: /Users/${var.principal_user}/.bundle/${bundle.name}/${bundle.target}
    run_as:
      user_name: ${var.principal_user}
    variables:
      environment: test

  prod:
    mode: production
    workspace:
      host: [REDACTED]
      root_path: /Users/${var.principal_user}/.bundle/${bundle.name}/${bundle.target}
    run_as:
      user_name: ${var.principal_user}
    variables:
      environment: prod
      pause_status: UNPAUSED
      limit: "-1"
pietern commented 3 months ago

Thanks for providing the config. I'm able to reproduce.

The underlying problem is that we changed how we store variable values. All values used to be cast into a string, so you could use YAML strings, integers, and bools interchangeably and it would work. We changed this to accommodate complex-valued variables and now they can assume any type. Mixing types at the YAML level is what's causing the issue here.

We'll investigate further and figure out how to support this better.

In the meantime, you can work around the issue by making all variable values explicit strings:

variables:
  # ...
  limit:
    description: The limit to use for testing
    default: "10"

Note the quotes around the value 10.

blood-onix commented 2 months ago

@pietern also catching this error with setting up job timeouts via variable.

I've tried to put 7200 into the quotes but still getting an error "cannot merge int with string", any workarounds?

Thanks!

targets:
  qa:
    mode: production
    workspace:
      host: http....
      root_path: /Workspace/TEST/.bundle/${bundle.name}/${bundle.target}
    variables:
      timeout_seconds: 7200
      warning_seconds: 5400
    run_as:
      service_principal_name: ${var.spn}
    resources:
      jobs:
        example_ingest:
          timeout_seconds: ${var.timeout_seconds}
          health:
            rules:
            - metric: RUN_DURATION_SECONDS
              op: GREATER_THAN
              value: ${var.warning_seconds}
blood-onix commented 2 months ago

@pietern also catching this error with setting up job timeouts via variable.

I've tried to put 7200 into the quotes but still getting an error "cannot merge int with string", any workarounds?

Thanks!

targets:
  qa:
    mode: production
    workspace:
      host: http....
      root_path: /Workspace/TEST/.bundle/${bundle.name}/${bundle.target}
    variables:
      timeout_seconds: 7200
      warning_seconds: 5400
    run_as:
      service_principal_name: ${var.spn}
    resources:
      jobs:
        example_ingest:
          timeout_seconds: ${var.timeout_seconds}
          health:
            rules:
            - metric: RUN_DURATION_SECONDS
              op: GREATER_THAN
              value: ${var.warning_seconds}

ok, workaround for my case is just to remove timeout_seconds from job template and push it only from the main bundle deployment file, in this case it works even without a quotes.

pietern commented 2 months ago

@blood-onix This sounds like a different issue.

Did you hard-code timeout_seconds: <some integer> in your base definition?

blood-onix commented 2 months ago

@blood-onix This sounds like a different issue.

Did you hard-code timeout_seconds: <some integer> in your base definition?

If you mean job template - yes, so the job was created manually via UI and when exported via databrircks bundle generate so timeout_seconds set in a job template. Idea is to overwrite value for a different targets while for dev it will use default value from a job template.

../resources/example_ingest.yml

resources:
  jobs:
    example_ingest:
      name: 'Example test ingest'
      email_notifications:
        on_duration_warning_threshold_exceeded:
        - redacted@email.com
        no_alert_for_skipped_runs: false
      webhook_notifications: {}
      timeout_seconds: 3200
      max_concurrent_runs: 1
      tasks:
      - task_key: Ingest
arcaputo3 commented 2 months ago

Thanks for providing the config. I'm able to reproduce.

The underlying problem is that we changed how we store variable values. All values used to be cast into a string, so you could use YAML strings, integers, and bools interchangeably and it would work. We changed this to accommodate complex-valued variables and now they can assume any type. Mixing types at the YAML level is what's causing the issue here.

We'll investigate further and figure out how to support this better.

In the meantime, you can work around the issue by making all variable values explicit strings:

variables:
  # ...
  limit:
    description: The limit to use for testing
    default: "10"

Note the quotes around the value 10.

This is helpful, thank you. Out of curiosity, is there a known reason why this only appears to be affecting us in the prod target and only via GitHub actions? Our dev and test CI/CD works, and locally on Windows 11 I can successfully run databricks bundle deploy -t prod.

pietern commented 2 months ago

@arcaputo3 If the configuration you provided is complete, then it is because only the prod target overrides the variable value (with an incompatible type). For the other targets, it can use the default provided at the top level directly.