Open sgerloff opened 3 months ago
@sgerloff Thanks for opening the issue. Could you inform us a bit about the use case behind the feature request?
At my company we use a tool implemented in python that syncs our local code to databricks for testing and development: https://github.com/getyourguide/db-rocket . This feature seems to be the only blocker to migrate to a databricks-cli native solution to sync local files with the DBFS. In addition, I guess any usecase that uses databricks fs cp -r
could potentially benefit form a sync operation.
I wouldn't mind contributing this feature myself. It seems like a very straightforward implementation, as all the pieces are already in place.
@sgerloff why do you choose to use DBFS and not workspace file system or UC volumes?
@andrewnester The main reason is the usecase for python modules. To enable some amount of local development we write code in our local IDE. Then we copy the python code to the DBFS and perform an import with
%pip install -e <path to files on DBFS>
Followed by
%load_ext autoreload
%autoreload 2
Note the "-e" flag for pip install. This enables to resync the code without reinstalling it making iterations between local and notebook quicker. The issue with the workspace is that it is not possible to do pip install -e
as the path seems to be just a symlink, afaiu. Not sure about the UC volume. How would I write to it, and is this expected to behave differently when importing the code in the notebook?
@sgerloff Have you considered using DABs? The databricks bundle
commands supporting building python wheel files natively which likely will provide you with an (mostly) equivalent local development experience without having to manage the sync tooling yourself.
docs: https://docs.databricks.com/en/dev-tools/bundles/python-wheel.html
@shreyas-goenka I would need to investigate this more. It does have the sync and watch option that we would need. But at first glance it does seem to be intended for something else (pipelines and jobs), instead of just pushing code to a volume that can be easily accessed from a databricks notebook. In fact, it does seem like a lot of overhead compared to either our inhouse soluton db-rocket, or a simple bash script using databricks sync --watch
.
Is there any concern allowing the databricks sync
upload files to the DBFS? Is this an discouraged pattern, or are you trying to be aware of alternatives?
Is there any concern allowing the databricks sync upload files to the DBFS?
@sgerloff Yes, the concern I have is it's another file system that we'll have to maintain for the databricks sync
command, and there are significant semantic differences between the way notebook metadata is stored in both of them that'll have to be reconciled. I'd rather avoid adding supporting if existing products already solve the usecase.
Note: Databricks Asset Bundles (available under the databricks bundle
command) are a very opinionated way for us to support the local development of Databricks projects. It's already GA and widely adopted by customers. I was wondering if it fits with your use case, and if not, then are there any gaps?
documentation: https://docs.databricks.com/en/dev-tools/bundles/index.html
@shreyas-goenka reading up on bundles (which btw seem very cool), it does seem it suffers from the same issue:
With the databricks.yml
:
bundle:
name: tripitem-demand
workspace:
root_path: dbfs:/desired/path/on/dbfs
artifacts:
dev:
type: whl
path: .
And calling databricks bundle sync --watch
i get the error message: Error: Path (dbfs:/desired/path/on/dbfs) doesn't start with '/'
.
It does suffer from the same issue, where in the sync module: https://github.com/databricks/cli/blob/37b9df96e6606d0655f1e574d4eebb72eedd9cde/libs/sync/sync.go#L102 the only filer used is the workspace filer. As a result, it is not possible to upload files to the DBFS, which is required to be visible from a notebook directely without a symlink. I might be missing something here. Is there some volume I can write to that is directely visible from a notebook with DABs?
(Note: I am pretty sure that building the wheel and syncing that would not solve my issue, as it would again not allow for the pip install -e
which is necessary to update code used in the notebook without reinstalling the dependency (which will reset many python variables).)
@sgerloff With DABs, we do not support writing files / wheel files to DBFS. We recently introduced support for uploading wheel files to UC volumes, which should work. https://github.com/databricks/cli/pull/1591
Note: In general we recommend using UC volumes instead of DBFS because of better security and governance.
(Note: I am pretty sure that building the wheel and syncing that would not solve my issue, as it would again not allow for the pip install -e which is necessary to update code used in the notebook without reinstalling the dependency (which will reset many python variables).)
@andrewnester How do we recommend users work around this problem today? Is it by incrementing the wheel version number?
@andrewnester the requirements for my usecase are:
pip install -e /path/to/synced/module
and setup %load_ext autoreload
, as well as %autoreload 2
dbutils.library.restartPython()
(this is why the pip install -e
is required as far as I understand) I think using DABs for just syncing files might be an overkill. Generally we recommend using UC Volumes or Workspace file system (WSFS) for uploading your files because it has better permission control comparing to DBFS.
That being said I anticipate that we will unlikely add support for DBFS for sync
and suggest to either using fs cp
command or UC Volumes / WSFS.
sync python files to some volume accessible to databricks notebooks (with the watch feature)
You can sync it to WSFS with databricks sync
change python code on local file system -> triggers sync updating the files on the databricks volume
This can be done with databricks sync
and --watch
flag
reexecute some python command using the synced module that uses the updated python code and get the new behavior without reinstalling or calling dbutils.library.restartPython()
You can use %pip
magic command to install libraries as notebook-scoped so you can achieve this behaviour
https://docs.databricks.com/en/libraries/notebooks-python-libraries.html
WSFS is accessible from notebooks so you should be able to reference it in pip install
@andrewnester The WSFS is indeed assessible but it is incompatible with the pip install -e
that ensures that the modules code can be updated.
It looks like that the feature request is not fitting your design goals. Would it be feasible to add a new command to the databricks fs
group? As databricks fs cp
does work, but is missing the convenience of the sync functionality, it would be really nice to use databricks fs sync --watch <local path> <remote path including dbfs>
. Its kind of semantics, but since databricks fs cp
is allowed to interact with DBFS, it does seem to be in scope for this command group.
it is incompatible with the pip install -e
could you please clarify this? what error do you get?
As to feature itself, we'll have an internal discussion on it and will come with you with any updates
@andrewnester The error is somewhat obfuscated, but its easy to reproduce:
/Workspace/your/path/to/python_module
%pip install --no-deps -e /Workspace/your/path/to/python_module
dbutils.library.restartPython()
from python_module import <whatever>
You should get the error ModuleNotFoundError
. The reason seems to be connected to symlink issues with the pip install editable mode.
Two ways to fix this:
!cp -r /Workspace/your/path/to/python_module /dbfs/temp/path/to/python_module
and update the pip install command pip install -e /dbfs/temp/path/to/python_module
, and now it should work fine. I hope this helps pin-pointing the problem.
Thanks for the detailed description, I will pass it on to the team owning this functionality and will update this ticket with any new information. In the meantime relying on fs cp
is the only viable option while we're exploring supporting DBFS for sync
@andrewnester I suspect that your stance on sync for DBFS, has not changed. Does it extend to unity catalog volumes? Currently, databricks seems to advise copying python code to /Volumes
, which is also not supported by the sync command.
I believe supporting Volumes should be done for sync
command, @pietern what do you think?
Describe the issue
The command
databricks sync
is limited to target directories on the Workspace. This may not be the desirable target location for all syncs. I propose to handle targets on DBFS similar to howdatabricks fs cp
does it, extending the range of applications for the sync command.Steps to reproduce the behavior
databricks sync <local_dir> dbfs:<dbfs dir>
Expected Behavior
Sync files and directories to specified location on DBFS
Actual Behavior
Complain about the path, due to using the Workspace filer.
OS and CLI version
All
Is this a regression?
No
Debug Logs
Honestly, this is more like a feature request...