Closed ericfeunekes closed 2 months ago
Are you wanting to deploy these bundles independently?
Are you wanting to deploy these bundles independently?
Yes. Currently we have close to 100 jobs/pipelines (and growing fast). Deploying them all as an atomic unit slows down both the development and deployment process.
My plan is to create a bundle file for each atomic deployable unit, then create a release bundle that include
s the units to deploy. Then I'll deploy the release bundle. Each release bundle will have a filename matching the release number, which is why I need to be able to specify the filename.
In the short term until this fix, you could make that release number a directory and write the corresponding bundle.yml
underneath that. I'm not saying it's pretty by any means, but might get you on your way for now. I agree there should be a -f, --file
argument that defaults to bundle.yml
.
you could make that release number a directory and write the corresponding
bundle.yml
underneath that.
Yeah I thought about doing it that way, but we currently have a cli that has the functionality we need with dbx
so it's not needed right now. Just something that would allow me to migrate, and is likely a better approach generally.
As a side note, we have a rollback functionality that allows you to rollback a specific pipeline by looking for the pipeline in a prior release and redeploy from that tag. This would be hard to build when you need to be in the bundle root, particularly as a temporary solution.
Sort of related -- you can specify the BUNDLE_ROOT
environment variable to point the CLI at the bundle root if it cannot be derived from your working directory. It still needs a databricks.yml
/ bundle.yml
file though, so it doesn't solve your problem entirely.
One possible solution is to have not a file name with the version number but a directory, and then place the config there.
One possible solution is to have not a file name with the version number but a directory, and then place the config there.
Thanks. I understand that you can do it this way, but it would add a lot of unnecessary folder structure and then the include
paths would become more complicated. And if you want to do it in CICD then you need to add environment variables to reach the bundle file based on your release number.
Is there a strong reason to require a single bundle.yml
file with no other name? As opposed to using that as a default and allowing the user to specify the bundle file they want to use (regardless of the name or other yml files).
It seems to be unnecessarily restrictive in how DABs is used. I've been working with my team for a while now to refine our release process and there are a lot of really great features in DABs that I think will help us even more. But this restriction offsets those other benefits because it forces you to setup the process in an inefficient way for what looks to be no real benefit (compared with doing the same with defaults, but allowing the user to specify a file).
I'm going to wait to migrate from dbx
until this is implemented. I would work on implementing it myself, but I'm not much of a Go developer.
Thanks for raising this and additional context. I'm contemplating the implications of doing this w.r.t. relative paths for both the bundle root as well as the includes. Today, we anchor paths on the bundle root. If the entry point can be specified, we have to deal with 1) implications for paths specified under this bundle root (ambiguity if relative to bundle root relative to entry point file), 2) possible ambiguity if there are multiple defined at the same path. All in all, it adds complexity, which is why I'm hesitant. I'll have a chat with the team, let this ruminate, and come back to this.
Yeah that definitely makes sense regarding the additional complexity. I can provide some additional comments to help the discussion:
deployment.yml
that has {% include ...%}
then I'm writing the path relative to the deployment.yml
file, which I think makes sense conceptually. If you're adding an include statement then the filepath should be relative to where that file lives. I can see this become more complex if you have nested include
s, but if you take a recursive approach to building the file then it shouldn't be too much more complex then a single nesting level (as long as it's a true DAG and no circular references).Hope that helps!
Thanks for raising this and additional context. I'm contemplating the implications of doing this w.r.t. relative paths for both the bundle root as well as the includes. Today, we anchor paths on the bundle root. If the entry point can be specified, we have to deal with 1) implications for paths specified under this bundle root (ambiguity if relative to bundle root relative to entry point file), 2) possible ambiguity if there are multiple defined at the same path. All in all, it adds complexity, which is why I'm hesitant. I'll have a chat with the team, let this ruminate, and come back to this.
As I work with this a bit more, I'm trying to understand what is the benefit of anchoring paths to the bundle root? Doing this means you have to put the bundle at the repo root anyways, since otherwise it would not have access to your app code. If that's true, then why not just anchor relative paths to the repo root? It would be a lot more intuitive and allow you to maintain separate folders for bundle files. Using the repo root to anchor paths is also a more common approach in most libraries, including DBX.
I'm trying to understand what is the benefit of anchoring paths to the bundle root? Doing this means you have to put the bundle at the repo root anyways, since otherwise it would not have access to your app code.
That depends on the layout of the repository. If you ship your common code as a wheel file then a repo can contain multiple independent bundles and still reuse that code. No files outside those independent bundle directories will then be visible to bundles in adjacent directories. This can be beneficial for huge mono repos where either 1) synchronizing more than you need, or 2) having to deal with (deeply?) nested paths in your deployed bundle is not optimal.
You're trying to achieve the inverse of this, where the intention is to share all files in a repo between multiple bundles and have these bundles be entry points into that code (e.g. bundle with ETL for domain A, bundle with ETL for domain B, where perhaps they only differ in a single parameter, and all their code is shared). Then they continue to be independently deployable, even though they share all code.
The crux of the problem is the (current) implied equivalence between the bundle root path and the path where databricks.yml
is specified. If we enable decoupling these, you'd be able to use your repository root as the bundle root for all your bundles and have a different root configuration file (databricks.yml
-equivalent) for each of them.
We discussed this and are planning to address this by adding the option to set DATABRICKS_BUNDLE_FILE
to override the file to look for. Then you can set that to ./foo/bar/databricks.yml
and it would pick up that file as the root configuration file, while retaining .
as the bundle root path. Includes specified in that file need to be relative to the directory of that file. Because decoupling this challenges existing assumptions we have to be careful in implementing this.
Awesome! Sounds like exactly what I'd need. I'm looking forward to it.
We discussed this and are planning to address this by adding the option to set
DATABRICKS_BUNDLE_FILE
to override the file to look for. Then you can set that to./foo/bar/databricks.yml
and it would pick up that file as the root configuration file, while retaining.
as the bundle root path. Includes specified in that file need to be relative to the directory of that file. Because decoupling this challenges existing assumptions we have to be careful in implementing this.
@pietern any update on the timeline for this feature. I don't need anything precise, but it would be good to know if it will be days, weeks, or months, to allow for planning. I need plan potential migration timelines and how much to support current dbx
based tools.
@pietern is this on the current roadmap? You mentioned that it's being worked on, just wondering what the timeline will be.
@ericfeunekes It is, though the specifics are tbd.
In the comments above we were talking about the bundle root and the path where databricks.yml
is specified. One consideration we didn't take into account yet, is that the bundle root is also where we place a cache directory for ephemeral state for a (bundle, target) pair. If we let multiple bundles use the same root, we need to disambiguate that as well. An alternative worth considering is that we define the "bundle root" to be the path to the databricks.yml
file, and a "sync root" to be the path that we end up synchronizing to the workspace. That way, we can keep the cache directory relative to the "bundle root" and have it guaranteed to be associated with the directory where databricks.yml
is located.
The net result is the same, with the caveat that it requires one directory per bundle. I.e. it is not possible to have multiple bundle root files in a single directory. The configuration in databricks.yml
could then refer to the sync root similar to:
sync:
root: ../.. # Repository root; defaults to "."
include:
- ./common
- ./something
- [...]
@pietern I think I understand what you're saying, but I don't think there's a need to reinvent anything, there are other tools that have this same issue and the typical approach is to define a cache folder in the root location and store ephemeral files there.
For example when you build a python library you get the dist
folder in your root; granted in this case the root is defined as the pyproject.toml
, but if you're able to define a different repo root, then that cache folder could go in the same place. The other example is dbt
, which compiles the models in a target
directory, which maintains the folder structure of the model definitions.
What you're proposing is better than we have now, which is no ability manage multiple bundles in a single release, but if you could define a separate cache folder at the defined repo root, that could allow for even more flexibility with how you work with the bundle files.
@pietern have you landed on an approach to this? I've been testing out bundles and your last suggestion makes sense. My main problem right now is figuring out how to import helper functions from a sister folder to the bundle root and I think the solution you proposed would do that.
Update: we have implemented this (albeit slightly differently) in #1694.
To include directories outside the bundle root directory, you can add a paths
section to the sync
block like so:
sync:
paths:
- "../common"
- "."
This configuration will sync the common
directory located one level above the bundle root directory and the bundle root directory itself. If the bundle root directory is named my_bundle
, then the tree in the Databricks workspace will look like this:
common/
common_file.txt
my_bundle/
databricks.yml
src/
...
The PR has been merged and this is slated for release in v0.227.0.
This has been released as part of v0.227.0.
Running bundle commands from a folder that holds a
bundle.yml
(i.e. thebundle root
) doesn't allow much flexibility in workflows; e.g. if I want to create multiple bundle files, use names other thanbundle
, or just run a command from the project root instead of the location of the bundle file.Instead, all bundle commands should use the specific bundle file as an argument.