This repository contains notebooks & instructions for setting up the demo of development workflow & CI/CD (on Azure DevOps) using the Databricks notebooks and Repos feature. Testing of notebooks is done using the Nutter library developed by Microsoft.
Two approaches are demonstrated:
%run
(doc) - the "main" code is in the notebooks Code1.py
and Code2.py
, and the testing code is in the unit-tests/test_with_percent_run.py
.my_package/code1.py
and my_package/code2.py
files, and test is in unit-tests/test_with_arbitrary_files.py
.This demo shows how you can use Repos to work on your own copy of notebooks, test them after commit in the "staging" environment, and promote to "production" on successful testing of releases
branch.
There is a possibility of automated setup of this demo using the Terraform. Look into terraform folder for existing implementations.
The development workflow is organized as on following image:
releases
in this setup), this would allow to run different sets of tests when we're preparing the releaseYour Databricks workspace needs to have Repos functionality enabled. If it's enabled, you should see the "Repos" icon in the navigation panel:
The Azure DevOps setup consists of the several steps, described in the next sections. It's assumed that project in Azure DevOps already exists.
We need to create a personal access token (PAT) that will be used for execution of the tests & updating the repository. This token will be used to authenticate to Databricks workspace, and then it will fetch configured token to authenticate to Git provider. We also need to connect Databricks workspace to the Git provider - usually it's done by using the provider-specific access tokens - see documentation on details of setting the integration with specific Git provider (note, that when repository is on Azure DevOps, you still need to generate Azure DevOps token to make API working!, and also provide the user name in the Git settings).
:warning: the previous instructions on using Repos + Azure DevOps with service principals weren't correct, so were removed!
Because we have several pipelines, the it's makes sense to define variable group to store the data that are necessary for execution of tests & deployment of the code. We need following configuration properties for execution of our pipelines:
databricks_host
- the URL of your workspace where tests will be executed (host name with https://
, without ?o=
, and without trailing slash character. For example: https://adb-1568830229861029.9.azuredatabricks.net
).databricks_token
- personal access token for executing commands against the workspace. Mark this variable as private!cluster_id
- the ID of the cluster where tests will be executed. DBR 9.1+ should be used to support arbitrary files.staging_directory
- the directory for staging checkout that we created above. For example, /Repos/Staging/databricks-nutter-repos-demo
.The name of the variable group is used in the azure-pipelines.yml. By default its name is "Nutter Testing". Change the azure-pipelines.yml if you use another name for variable group.
Azure DevOps can work with GitHub repositories as well - see documentation for more details on how to link DevOps with GitHub.
python -m pip install --upgrade databricks-cli
databricks repos update --path /Repos/Production/databricks-nutter-repos-demo --branch releases
DATABRICKS_TOKEN
with value $(DATABRICKS_TOKEN)
- this will pull it from the variable group into the script's execution contextreleases
branch
After all of this done, the release pipeline will be automatically executed on every successful build in the releases
branch.
We need to create a personal access token (PAT) that will be used for execution of the tests & updating the repository. This token will be used to authenticate to Databricks workspace, and then it will fetch configured token to authenticate to Git provider. We also need to connect Databricks workspace to the Git provider - usually it's done by using the provider-specific access tokens - see documentation on details of setting the integration with specific Git provider
Create dev, stage and prod Environments in github settings. With environments it is easy to use the same variables names and secret names accross different environments
Create the following properties within each environment:
databricks_host
- the URL of your workspace where tests will be executed (host name with https://
, without ?o=
, and without trailing slash character. For example: https://adb-1568830229861029.9.azuredatabricks.net
).
databricks_token
- personal access token for executing commands against the workspace. Create this as a secrete variable
cluster_id
- the ID of the cluster where tests will be executed. DBR 9.1+ should be used to support arbitrary files.
repo_directory
- the directory for checkout for specific environment. For example, /Repos/Staging/databricks-nutter-repos-demo
.
The workflow is the same as above and the pipeline looks as following:
This often happens when you're trying to use databricks repos update
for workspace that have IP Access Lists enabled. The error message is a misleading, and will be fixed by this pull request.
This usually happens when you're trying to run CI/CD pipeline against a Databricks workspace with IP Access Lists enabled, and CI/CD server not in the allow list.
To perform operations on Repos (update, etc.) we need to associate a Git token with an identity that performs that operation. Please see the following documentation: