databricks / cli

Databricks CLI
Other
130 stars 49 forks source link

Environments and serverless compute does not find new version of library #1621

Closed leonardovida closed 3 weeks ago

leonardovida commented 1 month ago

Describe the issue

Databricks environments using serverless compute cannot find the new version of my library wheel built with poetry. This similar setup was working perfectly with cluster.

Unfortunately I cannot find any decent documentation or guide around serverless compute (the databricks/bundle-examples repo is good but too simple)

As you will read below, the artifact is being correctly loaded at "/Workspace${workspace.root_path}/artifacts/.internal/test_databricks-0.0.1-py3-none-any.whl" (i.e. the file is updated) but it's not being picked up by the serverless environment. This happens also when destroying all workflows and I re-deploy them.

Configuration

resources:
  jobs:
    brouwers-afas:
      name: brouwers-afas

      schedule:
        quartz_cron_expression: ${var.brouwers_afas_schedule}
        timezone_id: Europe/Amsterdam

      webhook_notifications:
        on_failure:
          - id: ${var.notification_channel}

      notification_settings:
        no_alert_for_canceled_runs: true

      parameters:
      - name: "environment"
        default: "${bundle.target}"
      - name: "target"
        default: "brouwers-afas"
      - name: "full_refresh"
        default: "false"

      description: Ingestion of Brouwers AFAS API
      tags:
        environment: "${bundle.target}"
        label: "brouwers"
        source: "afas"
        type: "incremental"
      timeout_seconds: ${var.timeout_seconds}
      max_concurrent_runs: 1

      environments:
      - environment_key: afas_environment
        spec:
          client: "1"
          dependencies:
            - "/Workspace${workspace.root_path}/artifacts/.internal/test_databricks-0.0.1-py3-none-any.whl"
            - "/Volumes/${bundle.target}/commons/commons/libraries/afas-0.1.0-py3-none-any.whl"
            - "${var.pypi_pendulum}"

      tasks:
        - task_key: ingest-full-refresh
          environment_key: afas_environment
          spark_python_task:
            python_file: ${workspace.file_path}/test_databricks/notebooks/brouwers/afas/full_refresh.py
            parameters:
              - "environment=${bundle.target}"
              - "target=brouwers-afas"
              - "full_refresh=false"

and my databricks.yml is

# yaml-language-server: $schema=bundle_config_schema.json
bundle:
  name: pia_databricks
  deployment:
    fail_on_active_runs: true
    lock:
      enabled: true
      force: false

include:
  - resources/*.yml
  - resources/**/*.yml
  - resources/**/**/*.yml

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

Steps to reproduce the behavior

Expected Behavior

OS and CLI version

Serverless compute

Is this a regression?

Yes, in 15.3 cluster this was not happening

andrewnester commented 1 month ago

Thanks for reaching out! I believe it might be related to serverless envs not loading / updating wheel files if the version stays the same. What we recommend to do is to increase wheel version on every build by, for example, including timestamp as a version param, see this https://github.com/databricks/bundle-examples/blob/main/default_python/setup.py#L20

Could you give it a try and see if it helps?

leonardovida commented 1 month ago

Hey Andrew, thanks for the answer! I tried but getting errors as I cannot get environment dependencies to accept globs? I'm using * as these dependent libraries change very often. Let me know if there is a better/other way to define them. I previously tried with notebooks, but realized with serverless version I'd needed to add the %pip install within the notebook itself.

environments:
      - environment_key: afas_environment
        spec:
          client: "1"
          dependencies:
            - "/Workspace${workspace.root_path}/artifacts/.internal/*.whl" **<- is this correct?**
            - "/Volumes/${bundle.target}/commons/commons/libraries/afas-0.1.0-py3-none-any.whl"
            - "${var.pypi_pendulum}"

with error:

[...]
Library installation failed: Notebook environment installation failed:
WARNING: Requirement '/Workspace/Shared/.bundle/pia_databricks/testing/artifacts/.internal/*.whl' looks like a filename, but the file does not exist
ERROR: *.whl is not a valid wheel filename.
[...]

Also, is there a way to force-avoid the wheel version check? I would like not to be obliged to always increase the wheel version for a number of internal reasons.

andrewnester commented 1 month ago

The extension of the paths works only for local paths. In dependencies section instead of remote path you can specify local path to wheel files and it will be correcltly extended and interpolated to remote path

 dependencies:
            - "./your/local/path/*.whl"
            - "/Volumes/${bundle.target}/commons/commons/libraries/afas-0.1.0-py3-none-any.whl"
andrewnester commented 1 month ago

Also, is there a way to force-avoid the wheel version check? I would like not to be obliged to always increase the wheel version for a number of internal reasons

This is something that's happening on the platform side and not DABs, I don't know exact details of this behaviour, maybe @lennartkats-db can chime in with some comments?

leonardovida commented 1 month ago

@andrewnester thanks, [but it seems the environment is still caching the wheel] <- correction it seems it's working!. I'm wondering if it's possible to force the reset of the environment attached to a specific job on workflow deploy?

I have a next question given that I have your attention, please let me know if I should open a new ticket. If you look at the config I attached to this ticket you'll see that I use parameters to define workflow params to be passed to the job (e.g. dates). With clusters these parameters would pushed down/overwrite the individual notebook/task's parameters. With serverless and spark_python_task the workflow-level parameters are not pushed down and I seem to be having to repeat them for each task, is there a relevant workaround or documentation for this that I'm missing?

Thank you!

dgarridoa commented 1 month ago

I got the same issue recently. I dont know what changed on the platform side, but it seem that is caching a previous deployed wheel although that has been overwritten. The problem persists even after destroying and deploying again. I solved it changing the python wheel name, but I dont think that is a good solution. Is there a way to deploy the python wheel to a random subdirectory in in the artifacts/.internal directory (as happen with dbx) without adding code and just using the yaml file?

EDIT: A partial solution is to use the git commit as part of the artifact path where is deployed the python wheel.

workspace:
  artifact_path: /${workspace.root_path}/artifacts/${bundle.git.commit}
leonardovida commented 1 month ago

@dgarridoa thanks for your workaround! I actually prefer it much more than having to uniquely name each build, especially as we are calling the main library from hundreds of workflows. Please remember to comment in this thread if you find a way to force delete the cache on the platform side. My hint is that environments: should have been much further in development by now, but something went wrong there.

Still I would highly encourage the cli team to bring serverless at least to feature parity to what clusters were before and releasing a complete documentation with many examples of complex workflows (or best practices) from Databricks on how to best manage workflows with DABs and serverless. You guys have the bundle-examples repository, would be great if you could continue expanding it, especially as serverless is now GA.

andrewnester commented 1 month ago

We have an example of serverless job (https://github.com/databricks/bundle-examples/tree/main/knowledge_base/serverless_job) but it indeed does not cover use case for wheel files. We accept contributions and would appreciate any help with knowledge base of bundle examples, so please feel free to include any examples which you find useful.


Is there a way to deploy the python wheel to a random subdirectory in in the artifacts/.internal directory

@dgarridoa It's interesting why does it solve the issue for you? The way it works (or at least used to) is that library won't be updated if it has the same version and the path does not really matter. There might indeed be some changes on platform side, I will verify with corresponding team and let you know.


I'm wondering if it's possible to force the reset of the environment attached to a specific job on workflow deploy?

@leonardovida could you elaborate a bit what do you mean by that? like every time job is restarted, it starts with clean environment and all libraries are installed again?


With serverless and spark_python_task the workflow-level parameters are not pushed down and I seem to be having to repeat them for each task, is there a relevant workaround or documentation for this that I'm missing?

This is likely due to spark python task being used and not serverless. Spark task parameters work a bit differently, here's teh doc with some details https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html#task-parameters https://community.databricks.com/t5/data-engineering/retrieve-job-level-parameters-in-spark-python-task-not-notebooks/td-p/75324

In particular this part

To pass job parameters to tasks that are not configured with key-value parameters such as JAR or Spark Submit tasks, format arguments as {{job.parameters.[name]}}, replacing [name] with the key that identifies the parameter.


My hint is that environments: should have been much further in development by now, but something went wrong there.

Environments feature is still in Public Preview so we welcome any feedback on how to make it better, thank you!

leonardovida commented 1 month ago

@andrewnester sure! Quick context to clarify where I'm coming from. In all companies I have worked/ consulted on the usual approach to deploy workflows, before in dbx now using the cli, has been similar to the following:

Now, given that many people work on these repos and there are many changes to the library, in the "testing" env(s) the library is not versioned, while it usually is in acc/prod. With clusters, a new change would have been simply picked up after the databricks bundle deploy -t [ENV here] without any need of versioning it as it was overwriting the previous file in .bundle for that version in that environment. Suddenly with environments this does not happen anymore and the environment, in a non-explicit way, caches the previously installed libraries for that workflow (or at a deployment-level for each workflow individually?). The same "problem" happens with other internal libraries as well. Let's assume that our workflow needs a second library, this library suddenly will need to 1) be versioned and 2) the new version be referenced in all the workflows that make use of it (sure I can use ${} variables to speed up the change). I think it would be great if you could let us decide whether we want to follow this pattern or not. I'm not saying your approach is not correct, but the sudden enforcement of this pattern, without much documentation, made me + team pretty frustrated - so sorry for my tone! So TLDR: it would be great to have a way to guarantee a refresh of the environment across deployments of the bundle without having the previous libraries be cached and have priority over the "new one" even if the wheel has the same name.

Re parameters + bundle examples: thanks for the resources and will plan in making a PR to the bundle examples repo.

Thank you!

dgarridoa commented 1 month ago

@andrewnester I was having this issue yesterday with python wheel tasks in Job Compute using DBR 14.3 ML, where I was deploying new fixes to a failing workflow. I guess as the workflow does not change, all its parameters have the same values as its previous version, it keeps a cached workflow definition. Deploying the python wheel to a different path changes the workflow definiton, because the Dependent libraries parameter value changes for those tasks.

EDIT: It does not happen always, so might be hard to reproduce.

andrewnester commented 1 month ago

but the sudden enforcement of this pattern, without much documentation, made me + team pretty frustrated

thanks @leonardovida for the detailed feedback. I will pass it on to the team owning serverless and environment functionality. cc @lennartkats-db

In the meantime, versioning libraries is the right way to go for 2 reasons:

  1. Caching. With versioning, new versions are not cached and therefore loaded to environments. Indeed, as you pointed out, it can be solved with generating unique path. But, it does not actually work consistently due to point 2.
  2. Installing the same package with same version (or no version) works non-determenisticly and / or does not update the library to newer version. This is the way it works by design on Databricks platform side. And this is what @dgarridoa might be running into. So due to this unique paths were removed from bundles https://github.com/databricks/cli/pull/1015

To summarise, at the moment the best way forward is to use unique versions and in the meantime we'll reach out to team owning serverless, environments and library installation experience to see if / how can we can this better

andrewnester commented 1 month ago

@dgarridoa do you use cluster compute or serverless?

dgarridoa commented 1 month ago

@dgarridoa do you use cluster compute or serverless?

Cluster compute, specifically Job Compute.

dgarridoa commented 1 month ago

but the sudden enforcement of this pattern, without much documentation, made me + team pretty frustrated

thanks @leonardovida for the detailed feedback. I will pass it on to the team owning serverless and environment functionality. cc @lennartkats-db

In the meantime, versioning libraries is the right way to go for 2 reasons:

  1. Caching. With versioning, new versions are not cached and therefore loaded to environments. Indeed, as you pointed out, it can be solved with generating unique path. But, it does not actually work consistently due to point 2.
  2. Installing the same package with same version (or no version) works non-determenisticly and / or does not update the library to newer version. This is the way it works by design on Databricks platform side. And this is what @dgarridoa might be running into. So due to this unique paths were removed from bundles Do not add wheel content hash in uploaded Python wheel path #1015

To summarise, at the moment the best way forward is to use unique versions and in the meantime we'll reach out to team owning serverless, environments and library installation experience to see if / how can we can this better

A workaround for those who use poetry and don't want to add additional code to their CI/CD is to modify the Python wheel version by adding a local version identifier, such as a commit hash, as follows:

artifacts:
  default:
    type: whl
    build: poetry version $(poetry version --short)+${bundle.git.commit} && poetry build
    path: .
lennartkats-db commented 1 month ago

To add to the workaround above, please take a look at the time stamp that's added in the wheel file in the default DABs Python template: https://github.com/databricks/bundle-examples/blob/38a9eb001344121885e530351e4c4bafc01f6ca7/default_python/setup.py#L20. That's another way to work around any kind of caching.

kiwi-niels commented 1 month ago

@andrewnester, I currently have a similar setup with poetry and assets bundles and would like to move serverless. However, we have many notebook_tasks and it seems that environments can't be provided to notebook tasks. I'm not really in favor of using %pip magic commands. Is there some other way to do this?

leonardovida commented 1 month ago

@andrewnester, I currently have a similar setup with poetry and assets bundles and would like to move serverless. However, we have many notebook_tasks and it seems that environments can't be provided to notebook tasks. I'm not really in favor of using %pip magic commands. Is there some other way to do this?

Fwiw: none that I could find could give me what we had before with clusters. We decided to migrate all notebook workloads to spark_python tasks for this reason. But likely @lennartkats-db and/or @andrewnester can shed a better light on this (or give us a glimpse in what's coming? :) )!

lennartkats-db commented 1 month ago

@kiwi-niels

I'm not really in favor of using %pip magic commands. Is there some other way to do this?

Yeah, we'd like to offer something nicer at the platform level. Until then, %pip is your best bet if you want to use notebook tasks. But you can also can use a spark_python task or a python_wheel_task as seen in the default template and the example at the top of this ticket. With those task types you can use environments to configure the environment.

cc @leonardovida

leonardovida commented 3 weeks ago

Closing this as everything was answered