databricks / cli

Databricks CLI
Other
129 stars 48 forks source link

Symlinks ignored when uploading asset bundles #1706

Open michaelschuett-tomtom opened 3 weeks ago

michaelschuett-tomtom commented 3 weeks ago

Describe the issue

I have a symlink in my directory and it is silently ignored when uploading to the files folder. The databricks.yml files and resources that is loads are symlinks as well however it is able to read them. I can't find any docs about this but I likely missed something. Any reason that symlinks are not supported.

Configuration

Create a symlink run databricks bundle deploy and see that it is missing.

Steps to reproduce the behavior

Please list the steps required to reproduce the issue, for example:

  1. Run databricks bundle deploy ...
  2. Run databricks bundle run ...
  3. See error

Expected Behavior

A warning is output or better yet it just uploads the file.

Actual Behavior

It is silently ignored.

OS and CLI version

mac OS, all versions

Is this a regression?

no

michaelschuett-tomtom commented 3 weeks ago

This is resolved with https://github.com/databricks/cli/pull/1708

michaelschuett-tomtom commented 2 weeks ago

Would be awesome to get this upstreamed. I have started to build some bazel rules around this so you can start doing stuff like this.

databricks(
    name = "deploy_dev",
    outs = [],
    args = [
        "bundle", 
        "deploy",
        "--target", "dev"
    ],
    required_vars = [
        "docker_client_id",
        "docker_client_secret",
    ]
    srcs = [
        ":databricks_files",
        ":some_notebook",
        ":internal_wheel",
    ],
)

However it feels a little strange to open source this and have it pointing to my internal builds of the databricks CLI.

andrewnester commented 2 weeks ago

@michaelschuett-tomtom what is the reason you use symlinks in your bundle? I assume to link the content outside of bundle root? If so, in the latest CLI (0.227.0) we have a new functionality sync.paths which allows you to sync files outside of bundle root

https://github.com/databricks/cli/pull/1694

michaelschuett-tomtom commented 2 weeks ago

The goal here is mainly to make databricks play nicely with the bazel build system. The current problem is when bazel builds it's sandbox for commands to run inside of it creates some ugly path under /private/var/tmp/_bazel_username/${commit hash}/... and symlinks in dependent files which may be outputs from other bazel rules or just static files in your repo.

Here is an example of what the directory might look like.

ls -lah
total 0
drwxr-xr-x@ 10 schuettm  wheel   320B Aug 27 13:25 .
drwxr-xr-x@  4 schuettm  wheel   128B Aug 27 13:25 ..
lrwxr-xr-x@  1 schuettm  wheel   168B Aug 22 11:38 create_tasks -> /private/var/tmp/_bazel_schuettm/7f7b24c2ab40dffefa912dc1d5931ddc/execroot/_main/bazel-out/darwin_arm64-fastbuild/bin/workflows/databricks/somepath/create_tasks
lrwxr-xr-x@  1 schuettm  wheel    87B Aug 22 11:38 databricks.yml -> /Users/schuettm/Code/repo/workflows/databricks/somepath/databricks.yml
lrwxr-xr-x@  1 schuettm  wheel   169B Aug 22 11:38 deploy_dev.sh -> /private/var/tmp/_bazel_schuettm/7f7b24c2ab40dffefa912dc1d5931ddc/execroot/_main/bazel-out/darwin_arm64-fastbuild/bin/workflows/databricks/somepath/deploy_dev.sh
drwxr-xr-x@  3 schuettm  wheel    96B Aug 22 11:38 fixtures
drwxr-xr-x@  3 schuettm  wheel    96B Aug 27 09:14 resources
lrwxr-xr-x@  1 schuettm  wheel   167B Aug 22 11:38 select -> /private/var/tmp/_bazel_schuettm/7f7b24c2ab40dffefa912dc1d5931ddc/execroot/_main/bazel-out/darwin_arm64-fastbuild/bin/workflows/databricks/somepath/select
drwxr-xr-x@  5 schuettm  wheel   160B Aug 22 21:51 src
drwxr-xr-x@  3 schuettm  wheel    96B Aug 27 10:41 whls

The title of this could likely be "I want to make databricks asset bundles work with bazel". The sync.paths you mentioned does sound like it has some but not complete overlap with what I am trying to achieve inside bazel. As the initial reason for porting the databricks command to a bazel rule was so I could have wheel files that the bundle depends on be built and inserted into the asset bundle with one command thus greatly improving out current workflow of build and publish a new package then update the notebook to the newly created version.

michaelschuett-tomtom commented 1 week ago

Just a bump on this to try and keep it from becoming stale since I currently have the time to work or modify the linked PR provided upstream is willing to accept it.

pietern commented 6 days ago

@michaelschuett-tomtom Thanks for posting the issue and including the rationale.

It's great to hear you're looking to make the CLI work well with Bazel.

There are a couple of reasons why we ignore symlinks:

  1. We need bundle deployments to be reproducible, and symlinks may point anywhere. This means that if we both clone a repository that happens to have a symlink that points outside the repository, we'll produce different results.
  2. Symlinks can be recursive, and fixing this means ambiguity on how you treat symlinks (e.g., the first time you follow it, but the second time you skip it).
  3. We read gitignore files to figure out which files to synchronize and which files to skip. As Git does not chase symlinks, doing so introduces an asymmetry.

This doesn't help in building working Bazel rules, of course. But if you really only care about locally unrolling the symlink tree that Bazel builds, an alternative could be to run rsync with -L (or --copy-links) to create a symlink-free copy of the tree in a temporary directory, and then running the CLI. Would that work?