databricks / cli

Databricks CLI
Other
146 stars 55 forks source link

databricks bundle commands should take the bundle file as an argument #582

Closed ericfeunekes closed 2 months ago

ericfeunekes commented 1 year ago

Running bundle commands from a folder that holds a bundle.yml (i.e. the bundle root) doesn't allow much flexibility in workflows; e.g. if I want to create multiple bundle files, use names other than bundle, or just run a command from the project root instead of the location of the bundle file.

Instead, all bundle commands should use the specific bundle file as an argument.

sburns commented 1 year ago

Are you wanting to deploy these bundles independently?

ericfeunekes commented 1 year ago

Are you wanting to deploy these bundles independently?

Yes. Currently we have close to 100 jobs/pipelines (and growing fast). Deploying them all as an atomic unit slows down both the development and deployment process.

My plan is to create a bundle file for each atomic deployable unit, then create a release bundle that includes the units to deploy. Then I'll deploy the release bundle. Each release bundle will have a filename matching the release number, which is why I need to be able to specify the filename.

sburns commented 1 year ago

In the short term until this fix, you could make that release number a directory and write the corresponding bundle.yml underneath that. I'm not saying it's pretty by any means, but might get you on your way for now. I agree there should be a -f, --file argument that defaults to bundle.yml.

ericfeunekes commented 1 year ago

you could make that release number a directory and write the corresponding bundle.yml underneath that.

Yeah I thought about doing it that way, but we currently have a cli that has the functionality we need with dbx so it's not needed right now. Just something that would allow me to migrate, and is likely a better approach generally.

As a side note, we have a rollback functionality that allows you to rollback a specific pipeline by looking for the pipeline in a prior release and redeploy from that tag. This would be hard to build when you need to be in the bundle root, particularly as a temporary solution.

pietern commented 1 year ago

Sort of related -- you can specify the BUNDLE_ROOT environment variable to point the CLI at the bundle root if it cannot be derived from your working directory. It still needs a databricks.yml / bundle.yml file though, so it doesn't solve your problem entirely.

One possible solution is to have not a file name with the version number but a directory, and then place the config there.

ericfeunekes commented 1 year ago

One possible solution is to have not a file name with the version number but a directory, and then place the config there.

Thanks. I understand that you can do it this way, but it would add a lot of unnecessary folder structure and then the include paths would become more complicated. And if you want to do it in CICD then you need to add environment variables to reach the bundle file based on your release number.

Is there a strong reason to require a single bundle.yml file with no other name? As opposed to using that as a default and allowing the user to specify the bundle file they want to use (regardless of the name or other yml files).

It seems to be unnecessarily restrictive in how DABs is used. I've been working with my team for a while now to refine our release process and there are a lot of really great features in DABs that I think will help us even more. But this restriction offsets those other benefits because it forces you to setup the process in an inefficient way for what looks to be no real benefit (compared with doing the same with defaults, but allowing the user to specify a file).

I'm going to wait to migrate from dbx until this is implemented. I would work on implementing it myself, but I'm not much of a Go developer.

pietern commented 1 year ago

Thanks for raising this and additional context. I'm contemplating the implications of doing this w.r.t. relative paths for both the bundle root as well as the includes. Today, we anchor paths on the bundle root. If the entry point can be specified, we have to deal with 1) implications for paths specified under this bundle root (ambiguity if relative to bundle root relative to entry point file), 2) possible ambiguity if there are multiple defined at the same path. All in all, it adds complexity, which is why I'm hesitant. I'll have a chat with the team, let this ruminate, and come back to this.

ericfeunekes commented 1 year ago

Yeah that definitely makes sense regarding the additional complexity. I can provide some additional comments to help the discussion:

Hope that helps!

ericfeunekes commented 1 year ago

Thanks for raising this and additional context. I'm contemplating the implications of doing this w.r.t. relative paths for both the bundle root as well as the includes. Today, we anchor paths on the bundle root. If the entry point can be specified, we have to deal with 1) implications for paths specified under this bundle root (ambiguity if relative to bundle root relative to entry point file), 2) possible ambiguity if there are multiple defined at the same path. All in all, it adds complexity, which is why I'm hesitant. I'll have a chat with the team, let this ruminate, and come back to this.

As I work with this a bit more, I'm trying to understand what is the benefit of anchoring paths to the bundle root? Doing this means you have to put the bundle at the repo root anyways, since otherwise it would not have access to your app code. If that's true, then why not just anchor relative paths to the repo root? It would be a lot more intuitive and allow you to maintain separate folders for bundle files. Using the repo root to anchor paths is also a more common approach in most libraries, including DBX.

pietern commented 1 year ago

I'm trying to understand what is the benefit of anchoring paths to the bundle root? Doing this means you have to put the bundle at the repo root anyways, since otherwise it would not have access to your app code.

That depends on the layout of the repository. If you ship your common code as a wheel file then a repo can contain multiple independent bundles and still reuse that code. No files outside those independent bundle directories will then be visible to bundles in adjacent directories. This can be beneficial for huge mono repos where either 1) synchronizing more than you need, or 2) having to deal with (deeply?) nested paths in your deployed bundle is not optimal.

You're trying to achieve the inverse of this, where the intention is to share all files in a repo between multiple bundles and have these bundles be entry points into that code (e.g. bundle with ETL for domain A, bundle with ETL for domain B, where perhaps they only differ in a single parameter, and all their code is shared). Then they continue to be independently deployable, even though they share all code.

The crux of the problem is the (current) implied equivalence between the bundle root path and the path where databricks.yml is specified. If we enable decoupling these, you'd be able to use your repository root as the bundle root for all your bundles and have a different root configuration file (databricks.yml-equivalent) for each of them.

We discussed this and are planning to address this by adding the option to set DATABRICKS_BUNDLE_FILE to override the file to look for. Then you can set that to ./foo/bar/databricks.yml and it would pick up that file as the root configuration file, while retaining . as the bundle root path. Includes specified in that file need to be relative to the directory of that file. Because decoupling this challenges existing assumptions we have to be careful in implementing this.

ericfeunekes commented 1 year ago

Awesome! Sounds like exactly what I'd need. I'm looking forward to it.

ericfeunekes commented 1 year ago

We discussed this and are planning to address this by adding the option to set DATABRICKS_BUNDLE_FILE to override the file to look for. Then you can set that to ./foo/bar/databricks.yml and it would pick up that file as the root configuration file, while retaining . as the bundle root path. Includes specified in that file need to be relative to the directory of that file. Because decoupling this challenges existing assumptions we have to be careful in implementing this.

@pietern any update on the timeline for this feature. I don't need anything precise, but it would be good to know if it will be days, weeks, or months, to allow for planning. I need plan potential migration timelines and how much to support current dbx based tools.

ericfeunekes commented 11 months ago

@pietern is this on the current roadmap? You mentioned that it's being worked on, just wondering what the timeline will be.

pietern commented 11 months ago

@ericfeunekes It is, though the specifics are tbd.

In the comments above we were talking about the bundle root and the path where databricks.yml is specified. One consideration we didn't take into account yet, is that the bundle root is also where we place a cache directory for ephemeral state for a (bundle, target) pair. If we let multiple bundles use the same root, we need to disambiguate that as well. An alternative worth considering is that we define the "bundle root" to be the path to the databricks.yml file, and a "sync root" to be the path that we end up synchronizing to the workspace. That way, we can keep the cache directory relative to the "bundle root" and have it guaranteed to be associated with the directory where databricks.yml is located.

The net result is the same, with the caveat that it requires one directory per bundle. I.e. it is not possible to have multiple bundle root files in a single directory. The configuration in databricks.yml could then refer to the sync root similar to:

sync:
  root: ../..  # Repository root; defaults to "."
  include:
    - ./common
    - ./something
    - [...]
ericfeunekes commented 11 months ago

@pietern I think I understand what you're saying, but I don't think there's a need to reinvent anything, there are other tools that have this same issue and the typical approach is to define a cache folder in the root location and store ephemeral files there.

For example when you build a python library you get the dist folder in your root; granted in this case the root is defined as the pyproject.toml, but if you're able to define a different repo root, then that cache folder could go in the same place. The other example is dbt, which compiles the models in a target directory, which maintains the folder structure of the model definitions.

What you're proposing is better than we have now, which is no ability manage multiple bundles in a single release, but if you could define a separate cache folder at the defined repo root, that could allow for even more flexibility with how you work with the bundle files.

ericfeunekes commented 8 months ago

@pietern have you landed on an approach to this? I've been testing out bundles and your last suggestion makes sense. My main problem right now is figuring out how to import helper functions from a sister folder to the bundle root and I think the solution you proposed would do that.

pietern commented 2 months ago

Update: we have implemented this (albeit slightly differently) in #1694.

To include directories outside the bundle root directory, you can add a paths section to the sync block like so:

sync:
  paths:
    - "../common"
    - "."

This configuration will sync the common directory located one level above the bundle root directory and the bundle root directory itself. If the bundle root directory is named my_bundle, then the tree in the Databricks workspace will look like this:

common/
  common_file.txt
my_bundle/
  databricks.yml
  src/
    ...

The PR has been merged and this is slated for release in v0.227.0.

pietern commented 2 months ago

This has been released as part of v0.227.0.