databricks / cli

Databricks CLI
Other
147 stars 55 forks source link

Feature Request: `databricks sync` to target on DBFS (similar to `databricks fs cp`) #1619

Open sgerloff opened 3 months ago

sgerloff commented 3 months ago

Describe the issue

The command databricks sync is limited to target directories on the Workspace. This may not be the desirable target location for all syncs. I propose to handle targets on DBFS similar to how databricks fs cp does it, extending the range of applications for the sync command.

Steps to reproduce the behavior

  1. Run databricks sync <local_dir> dbfs:<dbfs dir>
  2. Errors for "not starting with /"

Expected Behavior

Sync files and directories to specified location on DBFS

Actual Behavior

Complain about the path, due to using the Workspace filer.

OS and CLI version

All

Is this a regression?

No

Debug Logs

Honestly, this is more like a feature request...

shreyas-goenka commented 3 months ago

@sgerloff Thanks for opening the issue. Could you inform us a bit about the use case behind the feature request?

sgerloff commented 3 months ago

At my company we use a tool implemented in python that syncs our local code to databricks for testing and development: https://github.com/getyourguide/db-rocket . This feature seems to be the only blocker to migrate to a databricks-cli native solution to sync local files with the DBFS. In addition, I guess any usecase that uses databricks fs cp -r could potentially benefit form a sync operation.

sgerloff commented 3 months ago

I wouldn't mind contributing this feature myself. It seems like a very straightforward implementation, as all the pieces are already in place.

andrewnester commented 3 months ago

@sgerloff why do you choose to use DBFS and not workspace file system or UC volumes?

sgerloff commented 3 months ago

@andrewnester The main reason is the usecase for python modules. To enable some amount of local development we write code in our local IDE. Then we copy the python code to the DBFS and perform an import with

%pip install -e <path to files on DBFS>

Followed by

 %load_ext autoreload
 %autoreload 2

Note the "-e" flag for pip install. This enables to resync the code without reinstalling it making iterations between local and notebook quicker. The issue with the workspace is that it is not possible to do pip install -e as the path seems to be just a symlink, afaiu. Not sure about the UC volume. How would I write to it, and is this expected to behave differently when importing the code in the notebook?

shreyas-goenka commented 3 months ago

@sgerloff Have you considered using DABs? The databricks bundle commands supporting building python wheel files natively which likely will provide you with an (mostly) equivalent local development experience without having to manage the sync tooling yourself.

docs: https://docs.databricks.com/en/dev-tools/bundles/python-wheel.html

sgerloff commented 3 months ago

@shreyas-goenka I would need to investigate this more. It does have the sync and watch option that we would need. But at first glance it does seem to be intended for something else (pipelines and jobs), instead of just pushing code to a volume that can be easily accessed from a databricks notebook. In fact, it does seem like a lot of overhead compared to either our inhouse soluton db-rocket, or a simple bash script using databricks sync --watch.

Is there any concern allowing the databricks sync upload files to the DBFS? Is this an discouraged pattern, or are you trying to be aware of alternatives?

shreyas-goenka commented 3 months ago

Is there any concern allowing the databricks sync upload files to the DBFS?

@sgerloff Yes, the concern I have is it's another file system that we'll have to maintain for the databricks sync command, and there are significant semantic differences between the way notebook metadata is stored in both of them that'll have to be reconciled. I'd rather avoid adding supporting if existing products already solve the usecase.

Note: Databricks Asset Bundles (available under the databricks bundle command) are a very opinionated way for us to support the local development of Databricks projects. It's already GA and widely adopted by customers. I was wondering if it fits with your use case, and if not, then are there any gaps? documentation: https://docs.databricks.com/en/dev-tools/bundles/index.html

sgerloff commented 3 months ago

@shreyas-goenka reading up on bundles (which btw seem very cool), it does seem it suffers from the same issue:

With the databricks.yml:

bundle:
  name: tripitem-demand

workspace:
  root_path: dbfs:/desired/path/on/dbfs

artifacts:
  dev:
    type: whl
    path: .

And calling databricks bundle sync --watch i get the error message: Error: Path (dbfs:/desired/path/on/dbfs) doesn't start with '/'.

It does suffer from the same issue, where in the sync module: https://github.com/databricks/cli/blob/37b9df96e6606d0655f1e574d4eebb72eedd9cde/libs/sync/sync.go#L102 the only filer used is the workspace filer. As a result, it is not possible to upload files to the DBFS, which is required to be visible from a notebook directely without a symlink. I might be missing something here. Is there some volume I can write to that is directely visible from a notebook with DABs?

(Note: I am pretty sure that building the wheel and syncing that would not solve my issue, as it would again not allow for the pip install -e which is necessary to update code used in the notebook without reinstalling the dependency (which will reset many python variables).)

shreyas-goenka commented 3 months ago

@sgerloff With DABs, we do not support writing files / wheel files to DBFS. We recently introduced support for uploading wheel files to UC volumes, which should work. https://github.com/databricks/cli/pull/1591

Note: In general we recommend using UC volumes instead of DBFS because of better security and governance.

(Note: I am pretty sure that building the wheel and syncing that would not solve my issue, as it would again not allow for the pip install -e which is necessary to update code used in the notebook without reinstalling the dependency (which will reset many python variables).)

@andrewnester How do we recommend users work around this problem today? Is it by incrementing the wheel version number?

sgerloff commented 3 months ago

@andrewnester the requirements for my usecase are:

andrewnester commented 3 months ago

I think using DABs for just syncing files might be an overkill. Generally we recommend using UC Volumes or Workspace file system (WSFS) for uploading your files because it has better permission control comparing to DBFS.

That being said I anticipate that we will unlikely add support for DBFS for sync and suggest to either using fs cp command or UC Volumes / WSFS.

sync python files to some volume accessible to databricks notebooks (with the watch feature)

You can sync it to WSFS with databricks sync

change python code on local file system -> triggers sync updating the files on the databricks volume

This can be done with databricks sync and --watch flag

reexecute some python command using the synced module that uses the updated python code and get the new behavior without reinstalling or calling dbutils.library.restartPython()

You can use %pip magic command to install libraries as notebook-scoped so you can achieve this behaviour https://docs.databricks.com/en/libraries/notebooks-python-libraries.html WSFS is accessible from notebooks so you should be able to reference it in pip install

sgerloff commented 3 months ago

@andrewnester The WSFS is indeed assessible but it is incompatible with the pip install -e that ensures that the modules code can be updated.

It looks like that the feature request is not fitting your design goals. Would it be feasible to add a new command to the databricks fs group? As databricks fs cp does work, but is missing the convenience of the sync functionality, it would be really nice to use databricks fs sync --watch <local path> <remote path including dbfs>. Its kind of semantics, but since databricks fs cp is allowed to interact with DBFS, it does seem to be in scope for this command group.

andrewnester commented 3 months ago

it is incompatible with the pip install -e

could you please clarify this? what error do you get?

As to feature itself, we'll have an internal discussion on it and will come with you with any updates

sgerloff commented 3 months ago

@andrewnester The error is somewhat obfuscated, but its easy to reproduce:

  1. Upload any python module to your workspace /Workspace/your/path/to/python_module
  2. Create a notebook and create the first cell:
    %pip install --no-deps -e /Workspace/your/path/to/python_module
    dbutils.library.restartPython()
  3. Then try to import the module: from python_module import <whatever>

You should get the error ModuleNotFoundError. The reason seems to be connected to symlink issues with the pip install editable mode.

Two ways to fix this:

  1. If you remove the editable flag from pip install it should work just fine. (But lacks the required feature to allow us to update the underlying code)
  2. Copy to DBFS !cp -r /Workspace/your/path/to/python_module /dbfs/temp/path/to/python_module and update the pip install command pip install -e /dbfs/temp/path/to/python_module, and now it should work fine.

I hope this helps pin-pointing the problem.

andrewnester commented 3 months ago

Thanks for the detailed description, I will pass it on to the team owning this functionality and will update this ticket with any new information. In the meantime relying on fs cp is the only viable option while we're exploring supporting DBFS for sync

sgerloff commented 1 month ago

@andrewnester I suspect that your stance on sync for DBFS, has not changed. Does it extend to unity catalog volumes? Currently, databricks seems to advise copying python code to /Volumes, which is also not supported by the sync command.

andrewnester commented 4 weeks ago

I believe supporting Volumes should be done for sync command, @pietern what do you think?