databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
441 stars 120 forks source link

sync workflow with git, remote notebook and local IDE #654

Open prise6 opened 1 year ago

prise6 commented 1 year ago

Hello,

As documented in "mixed-mode dev loop for python project" (official doc):

Simple tree of project:

.
├── src/               # <-- package dir
├── notebooks/     # <-- notebook dir
.
  1. i use my local IDE to create my functions/classes as a python package
  2. i use dbx sync to update this package and to make it available to the remote Repo
  3. i use notebooks in this Repo to call my package and test interactively my code. 3b. i write new code in my notebook to explore data and so on.
  4. i use a local pre-commit to test quality of my code before pushing to remote git repo.
  5. i stop dbx sync
  6. i want to pull final changes of my remote git repo to databricks Repo using git interface.

A this final 6th step, here is my issue so to speak: i've got conflicts because of dbx sync which have already updated the target files (that's normal).

My workaround is to discard changes before pulling the changes. BUT, i want to keep notebook i updated in the databricks workspace. So i discard file by file all files except notebooks.

My questions:

One solution is to commit first the notebook and then discard changes.

alexeyegorov commented 1 year ago

Hi @prise6, I think I use the same strategy as you, but have some minor difference: I try to keep writing code in my IDE also for the notebook. As it is enough to just hit the save button, the code is then automatically updated on the remote. Then I run it and check it interactively. From time to time, it happens, that I find myself in coding some quick changes in the notebook in Databricks. The easiest thing to do for me is to copy the changes to my local.

When I am done with developing the feature, I use a job to execute it instead of pulling the changes just to the same folder (I do this currently still manually, but hope to change it in the future). However, I think that discarding the changes in the Databricks Repo could be the way to go as you describe it.

matthayes commented 1 year ago

I follow the same development process as @alexeyegorov. This has worked pretty well for me. Further, I typically use a repo in Databricks that is not linked to a remote git repository, as I am syncing from my local copy anyways. You can set this up by unchecking "Create repo by cloning a Git repository" in the "Add Repo" dialog. You could have a second repo that is linked to a git repo that you can pull all the changes from the repo for testing.

If you do want to continue developing with only a single repo but without the conflict troubles, there is another option you could consider, but it takes a little more setup. In the dbx sync reference you'll see some details about using dbx sync dbfs. You can add a cell like the one below near the top of your notebook before you import other modules. This will cause it to import your code from the DBFS path instead of the repo, as the DBFS path appears first in the sys.path list. The Python modules you've edited will only sync to DBFS and not the repo, so you won't run into conflicts. This was the dev process I first used when I first developed the sync command, but these days I find it easiest to just sync to the unlinked repo witih dbx sync repo as described above.

import sys

if "/dbfs/tmp/users/first.last/myrepo" not in sys.path:
    sys.path.insert(0, "/dbfs/tmp/users/first.last/myrepo")