Closed leonardovida closed 3 weeks ago
Thanks for reaching out! I believe it might be related to serverless envs not loading / updating wheel files if the version stays the same. What we recommend to do is to increase wheel version on every build by, for example, including timestamp as a version param, see this https://github.com/databricks/bundle-examples/blob/main/default_python/setup.py#L20
Could you give it a try and see if it helps?
Hey Andrew, thanks for the answer! I tried but getting errors as I cannot get environment dependencies
to accept globs? I'm using *
as these dependent libraries change very often. Let me know if there is a better/other way to define them. I previously tried with notebooks, but realized with serverless version I'd needed to add the %pip install
within the notebook itself.
environments:
- environment_key: afas_environment
spec:
client: "1"
dependencies:
- "/Workspace${workspace.root_path}/artifacts/.internal/*.whl" **<- is this correct?**
- "/Volumes/${bundle.target}/commons/commons/libraries/afas-0.1.0-py3-none-any.whl"
- "${var.pypi_pendulum}"
with error:
[...]
Library installation failed: Notebook environment installation failed:
WARNING: Requirement '/Workspace/Shared/.bundle/pia_databricks/testing/artifacts/.internal/*.whl' looks like a filename, but the file does not exist
ERROR: *.whl is not a valid wheel filename.
[...]
Also, is there a way to force-avoid the wheel version check? I would like not to be obliged to always increase the wheel version for a number of internal reasons.
The extension of the paths works only for local paths. In dependencies section instead of remote path you can specify local path to wheel files and it will be correcltly extended and interpolated to remote path
dependencies:
- "./your/local/path/*.whl"
- "/Volumes/${bundle.target}/commons/commons/libraries/afas-0.1.0-py3-none-any.whl"
Also, is there a way to force-avoid the wheel version check? I would like not to be obliged to always increase the wheel version for a number of internal reasons
This is something that's happening on the platform side and not DABs, I don't know exact details of this behaviour, maybe @lennartkats-db can chime in with some comments?
@andrewnester thanks, [but it seems the environment is still caching the wheel] <- correction it seems it's working!. I'm wondering if it's possible to force the reset of the environment attached to a specific job on workflow deploy?
I have a next question given that I have your attention, please let me know if I should open a new ticket. If you look at the config I attached to this ticket you'll see that I use parameters
to define workflow params to be passed to the job (e.g. dates). With clusters these parameters would pushed down/overwrite the individual notebook/task's parameters. With serverless and spark_python_task the workflow-level parameters are not pushed down and I seem to be having to repeat them for each task, is there a relevant workaround or documentation for this that I'm missing?
Thank you!
I got the same issue recently. I dont know what changed on the platform side, but it seem that is caching a previous deployed wheel although that has been overwritten. The problem persists even after destroying and deploying again. I solved it changing the python wheel name, but I dont think that is a good solution. Is there a way to deploy the python wheel to a random subdirectory in in the artifacts/.internal directory (as happen with dbx) without adding code and just using the yaml file?
EDIT: A partial solution is to use the git commit as part of the artifact path where is deployed the python wheel.
workspace:
artifact_path: /${workspace.root_path}/artifacts/${bundle.git.commit}
@dgarridoa thanks for your workaround! I actually prefer it much more than having to uniquely name each build, especially as we are calling the main library from hundreds of workflows.
Please remember to comment in this thread if you find a way to force delete the cache on the platform side. My hint is that environments:
should have been much further in development by now, but something went wrong there.
Still I would highly encourage the cli team to bring serverless at least to feature parity to what clusters were before and releasing a complete documentation with many examples of complex workflows (or best practices) from Databricks on how to best manage workflows with DABs and serverless. You guys have the bundle-examples
repository, would be great if you could continue expanding it, especially as serverless is now GA.
We have an example of serverless job (https://github.com/databricks/bundle-examples/tree/main/knowledge_base/serverless_job) but it indeed does not cover use case for wheel files. We accept contributions and would appreciate any help with knowledge base of bundle examples, so please feel free to include any examples which you find useful.
Is there a way to deploy the python wheel to a random subdirectory in in the artifacts/.internal directory
@dgarridoa It's interesting why does it solve the issue for you? The way it works (or at least used to) is that library won't be updated if it has the same version and the path does not really matter. There might indeed be some changes on platform side, I will verify with corresponding team and let you know.
I'm wondering if it's possible to force the reset of the environment attached to a specific job on workflow deploy?
@leonardovida could you elaborate a bit what do you mean by that? like every time job is restarted, it starts with clean environment and all libraries are installed again?
With serverless and spark_python_task the workflow-level parameters are not pushed down and I seem to be having to repeat them for each task, is there a relevant workaround or documentation for this that I'm missing?
This is likely due to spark python task being used and not serverless. Spark task parameters work a bit differently, here's teh doc with some details https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html#task-parameters https://community.databricks.com/t5/data-engineering/retrieve-job-level-parameters-in-spark-python-task-not-notebooks/td-p/75324
In particular this part
To pass job parameters to tasks that are not configured with key-value parameters such as JAR or Spark Submit tasks, format arguments as {{job.parameters.[name]}}, replacing [name] with the key that identifies the parameter.
My hint is that environments: should have been much further in development by now, but something went wrong there.
Environments feature is still in Public Preview so we welcome any feedback on how to make it better, thank you!
@andrewnester sure! Quick context to clarify where I'm coming from. In all companies I have worked/ consulted on the usual approach to deploy workflows, before in dbx now using the cli, has been similar to the following:
.yml
file somewhere that will be picked up by databricks.yml
resources: \n jobs: [...]
and then your tasks, that usually are either notebooks or spark_python
.Now, given that many people work on these repos and there are many changes to the library, in the "testing" env(s) the library is not versioned, while it usually is in acc/prod. With clusters, a new change would have been simply picked up after the databricks bundle deploy -t [ENV here]
without any need of versioning it as it was overwriting the previous file in .bundle
for that version in that environment. Suddenly with environments
this does not happen anymore and the environment, in a non-explicit way, caches the previously installed libraries for that workflow (or at a deployment-level for each workflow individually?). The same "problem" happens with other internal libraries as well. Let's assume that our workflow needs a second library, this library suddenly will need to 1) be versioned and 2) the new version be referenced in all the workflows that make use of it (sure I can use ${} variables to speed up the change). I think it would be great if you could let us decide whether we want to follow this pattern or not. I'm not saying your approach is not correct, but the sudden enforcement of this pattern, without much documentation, made me + team pretty frustrated - so sorry for my tone! So TLDR: it would be great to have a way to guarantee a refresh of the environment across deployments of the bundle
without having the previous libraries be cached and have priority over the "new one" even if the wheel has the same name.
Re parameters + bundle examples: thanks for the resources and will plan in making a PR to the bundle examples repo.
Thank you!
@andrewnester I was having this issue yesterday with python wheel tasks in Job Compute using DBR 14.3 ML, where I was deploying new fixes to a failing workflow. I guess as the workflow does not change, all its parameters have the same values as its previous version, it keeps a cached workflow definition. Deploying the python wheel to a different path changes the workflow definiton, because the Dependent libraries parameter value changes for those tasks.
EDIT: It does not happen always, so might be hard to reproduce.
but the sudden enforcement of this pattern, without much documentation, made me + team pretty frustrated
thanks @leonardovida for the detailed feedback. I will pass it on to the team owning serverless and environment functionality. cc @lennartkats-db
In the meantime, versioning libraries is the right way to go for 2 reasons:
To summarise, at the moment the best way forward is to use unique versions and in the meantime we'll reach out to team owning serverless, environments and library installation experience to see if / how can we can this better
@dgarridoa do you use cluster compute or serverless?
@dgarridoa do you use cluster compute or serverless?
Cluster compute, specifically Job Compute.
but the sudden enforcement of this pattern, without much documentation, made me + team pretty frustrated
thanks @leonardovida for the detailed feedback. I will pass it on to the team owning serverless and environment functionality. cc @lennartkats-db
In the meantime, versioning libraries is the right way to go for 2 reasons:
- Caching. With versioning, new versions are not cached and therefore loaded to environments. Indeed, as you pointed out, it can be solved with generating unique path. But, it does not actually work consistently due to point 2.
- Installing the same package with same version (or no version) works non-determenisticly and / or does not update the library to newer version. This is the way it works by design on Databricks platform side. And this is what @dgarridoa might be running into. So due to this unique paths were removed from bundles Do not add wheel content hash in uploaded Python wheel path #1015
To summarise, at the moment the best way forward is to use unique versions and in the meantime we'll reach out to team owning serverless, environments and library installation experience to see if / how can we can this better
A workaround for those who use poetry and don't want to add additional code to their CI/CD is to modify the Python wheel version by adding a local version identifier, such as a commit hash, as follows:
artifacts:
default:
type: whl
build: poetry version $(poetry version --short)+${bundle.git.commit} && poetry build
path: .
To add to the workaround above, please take a look at the time stamp that's added in the wheel file in the default DABs Python template: https://github.com/databricks/bundle-examples/blob/38a9eb001344121885e530351e4c4bafc01f6ca7/default_python/setup.py#L20. That's another way to work around any kind of caching.
@andrewnester, I currently have a similar setup with poetry and assets bundles and would like to move serverless. However, we have many notebook_tasks and it seems that environments can't be provided to notebook tasks. I'm not really in favor of using %pip magic commands. Is there some other way to do this?
@andrewnester, I currently have a similar setup with poetry and assets bundles and would like to move serverless. However, we have many notebook_tasks and it seems that environments can't be provided to notebook tasks. I'm not really in favor of using %pip magic commands. Is there some other way to do this?
Fwiw: none that I could find could give me what we had before with clusters. We decided to migrate all notebook workloads to spark_python
tasks for this reason. But likely @lennartkats-db and/or @andrewnester can shed a better light on this (or give us a glimpse in what's coming? :) )!
@kiwi-niels
I'm not really in favor of using %pip magic commands. Is there some other way to do this?
Yeah, we'd like to offer something nicer at the platform level. Until then, %pip
is your best bet if you want to use notebook tasks. But you can also can use a spark_python
task or a python_wheel_task
as seen in the default template and the example at the top of this ticket. With those task types you can use environments
to configure the environment.
cc @leonardovida
Closing this as everything was answered
Describe the issue
Databricks environments using serverless compute cannot find the new version of my library wheel built with poetry. This similar setup was working perfectly with cluster.
Unfortunately I cannot find any decent documentation or guide around serverless compute (the databricks/bundle-examples repo is good but too simple)
As you will read below, the artifact is being correctly loaded at "/Workspace${workspace.root_path}/artifacts/.internal/test_databricks-0.0.1-py3-none-any.whl" (i.e. the file is updated) but it's not being picked up by the serverless environment. This happens also when destroying all workflows and I re-deploy them.
Configuration
and my databricks.yml is
Steps to reproduce the behavior
Expected Behavior
OS and CLI version
Serverless compute
Is this a regression?
Yes, in 15.3 cluster this was not happening