databricks / run-notebook

Apache License 2.0
49 stars 19 forks source link

databricks/run-notebook v0

Overview

Given a Databricks notebook and cluster specification, this Action runs the notebook as a one-time Databricks Job run (docs: AWS | Azure | GCP) and awaits its completion:

You can use this Action to trigger code execution on Databricks for CI (e.g. on pull requests) or CD (e.g. on pushes to master).

Prerequisites

To use this Action, you need a Databricks REST API token to trigger notebook execution and await completion. The API token must be associated with a principal with the following permissions:

We recommend that you store the Databricks REST API token in GitHub Actions secrets to pass it into your GitHub Workflow. The following section lists recommended approaches for token creation by cloud.

Note: we recommend that you do not run this Action against workspaces with IP restrictions. GitHub-hosted action runners have a wide range of IP addresses, making it difficult to whitelist.

AWS

For security reasons, we recommend creating and using a Databricks service principal API token. You can create a service principal, grant the Service Principal token usage permissions, and generate an API token on its behalf.

Azure

For security reasons, we recommend using a Databricks service principal AAD token.

Create an Azure Service Principal

Here are two ways that you can create an Azure Service Principal.

The first way is via the Azure Portal UI. See the Azure Databricks documentation. Record the Application (client) Id, Directory (tenant) Id, and client secret values generated by the steps.

The second way is via the Azure CLI. You can follow the instructions below:

From the resulting JSON output, record the following values:

After you create an Azure Service Principal, you should add it to your Azure Databricks workspace using the SCIM API. Use the client or application Id of your service principal as the applicationId of the service principal in the add-service-principal payload.

Use the Service Principal in your GitHub Workflow

GCP

For security reasons, we recommend inviting a service user to your Databricks workspace and using their API token. You can invite a service user to your workspace, log into the workspace as the service user, and create a personal access token to pass into your GitHub Workflow.

Usage

See action.yml for the latest interface and docs.

(Recommended) Run notebook within a temporary checkout of the current Repo

The workflow below runs a notebook as a one-time job within a temporary repo checkout, enabled by specifying the git-commit, git-branch, or git-tag parameter. You can use this to run notebooks that depend on other notebooks or files (e.g. Python modules in .py files) within the same repo.

name: Run a notebook within its repo on PRs

on:
  pull_request

env:
  DATABRICKS_HOST: https://adb-XXXX.XX.azuredatabricks.net

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checks out the repo
        uses: actions/checkout@v2
      # The step below does the following:
      # 1. Sends a POST request to generate an Azure Active Directory token for an Azure service principal
      # 2. Parses the token from the request response and then saves that in as DATABRICKS_TOKEN in the
      # GitHub enviornment.
      # Note: if the API request fails, the request response json will not have an "access_token" field and
      # the DATABRICKS_TOKEN env variable will be empty.
      - name: Generate and save AAD Token
        run: |
          echo "DATABRICKS_TOKEN=$(curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
            https://login.microsoftonline.com/${{ secrets.AZURE_SP_TENANT_ID }}/oauth2/v2.0/token \
            -d 'client_id=${{ secrets.AZURE_SP_APPLICATION_ID }}' \
            -d 'grant_type=client_credentials' \
            -d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' \
            -d 'client_secret=${{ secrets.AZURE_SP_CLIENT_SECRET }}' |  jq -r  '.access_token')" >> $GITHUB_ENV
      - name: Trigger model training notebook from PR branch
        uses: databricks/run-notebook@v0
        with:
          local-notebook-path: notebooks/deployments/MainNotebook
          # If the current workflow is triggered from a PR,
          # run notebook code from the PR's head commit, otherwise use github.sha.
          git-commit: ${{ github.event.pull_request.head.sha || github.sha }}
          # The cluster JSON below is for Azure Databricks. On AWS and GCP, set
          # node_type_id to an appropriate node type, e.g. "i3.xlarge" for
          # AWS or "n1-highmem-4" for GCP
          new-cluster-json: >
            {
              "num_workers": 1,
              "spark_version": "11.3.x-scala2.12",
              "node_type_id": "Standard_D3_v2"
            }
          # Grant all users view permission on the notebook results
          access-control-list-json: >
            [
              {
                "group_name": "users",
                "permission_level": "CAN_VIEW"
              }
            ]

Run a self-contained notebook

The workflow below runs a self-contained notebook as a one-time job.

Python library dependencies are declared in the notebook itself using notebook-scoped libraries (AWS | Azure | GCP)

name: Run a notebook in the current repo on PRs

on:
  pull_request

env:
  DATABRICKS_HOST: https://adb-XXXX.XX.azuredatabricks.net

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repo
        uses: actions/checkout@v2
      # The step below does the following:
      # 1. Sends a POST request to generate an Azure Active Directory token for an Azure service principal
      # 2. Parses the token from the request response and then saves that in as DATABRICKS_TOKEN in the
      # GitHub enviornment.
      # Note: if the API request fails, the request response json will not have an "access_token" field and
      # the DATABRICKS_TOKEN env variable will be empty.
      - name: Generate and save AAD Token
        run: |
          echo "DATABRICKS_TOKEN=$(curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
            https://login.microsoftonline.com/${{ secrets.AZURE_SP_TENANT_ID }}/oauth2/v2.0/token \
            -d 'client_id=${{ secrets.AZURE_SP_APPLICATION_ID }}' \
            -d 'grant_type=client_credentials' \
            -d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' \
            -d 'client_secret=${{ secrets.AZURE_SP_CLIENT_SECRET }}' |  jq -r  '.access_token')" >> $GITHUB_ENV
      - name: Trigger notebook from PR branch
        uses: databricks/run-notebook@v0
        with:
          local-notebook-path: notebooks/MainNotebook.py
          # Alternatively, specify an existing-cluster-id to run against an existing cluster.
          # The cluster JSON below is for Azure Databricks. On AWS and GCP, set
          # node_type_id to an appropriate node type, e.g. "i3.xlarge" for
          # AWS or "n1-highmem-4" for GCP
          new-cluster-json: >
            {
              "num_workers": 1,
              "spark_version": "11.3.x-scala2.12",
              "node_type_id": "Standard_D3_v2"
            }
          # Grant all users view permission on the notebook results, so that they can
          # see the result of our CI notebook 
          access-control-list-json: >
            [
              {
                "group_name": "users",
                "permission_level": "CAN_VIEW"
              }
            ]

Run a notebook using library dependencies in the current repo and on PyPI

In the workflow below, we build Python code in the current repo into a wheel, use upload-dbfs-temp to upload it to a tempfile in DBFS, then run a notebook that depends on the wheel, in addition to other libraries publicly available on PyPI.

Databricks supports a range of library types, including Maven and CRAN. See the docs (Azure | AWS | GCP) for more information.

name: Run a single notebook on PRs

on:
  pull_request

env:
  DATABRICKS_HOST: https://adb-XXXX.XX.azuredatabricks.net
jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checks out the repo
        uses: actions/checkout@v2
      # The step below does the following:
      # 1. Sends a POST request to generate an Azure Active Directory token for an Azure service principal
      # 2. Parses the token from the request response and then saves that in as DATABRICKS_TOKEN in the
      # GitHub enviornment.
      # Note: if the API request fails, the request response json will not have an "access_token" field and
      # the DATABRICKS_TOKEN env variable will be empty.
      - name: Generate and save AAD Token
        run: |
          echo "DATABRICKS_TOKEN=$(curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
            https://login.microsoftonline.com/${{ secrets.AZURE_SP_TENANT_ID }}/oauth2/v2.0/token \
            -d 'client_id=${{ secrets.AZURE_SP_APPLICATION_ID }}' \
            -d 'grant_type=client_credentials' \
            -d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' \
            -d 'client_secret=${{ secrets.AZURE_SP_CLIENT_SECRET }}' |  jq -r  '.access_token')" >> $GITHUB_ENV
      - name: Setup python
        uses: actions/setup-python@v2
      - name: Build wheel
        run: |
          python setup.py bdist_wheel
      # Uploads local file (Python wheel) to temporary Databricks DBFS
      # path and returns path. See https://github.com/databricks/upload-dbfs-tempfile
      # for details.
      - name: Upload Wheel
        uses: databricks/upload-dbfs-temp@v0
        with:
          local-path: dist/my-project.whl
        id: upload_wheel
      - name: Trigger model training notebook from PR branch
        uses: databricks/run-notebook@v0
        with:
          local-notebook-path: notebooks/deployments/MainNotebook
          # Install the wheel built in the previous step as a library
          # on the cluster used to run our notebook
          libraries-json: >
            [
              { "whl": "${{ steps.upload_wheel.outputs.dbfs-file-path }}" },
              { "pypi": "mlflow" }
            ]
          # The cluster JSON below is for Azure Databricks. On AWS and GCP, set
          # node_type_id to an appropriate node type, e.g. "i3.xlarge" for
          # AWS or "n1-highmem-4" for GCP
          new-cluster-json: >
            {
              "num_workers": 1,
              "spark_version": "11.3.x-scala2.12",
              "node_type_id": "Standard_D3_v2"
            }
          # Grant all users view permission on the notebook results
          access-control-list-json: >
            [
              {
                "group_name": "users",
                "permission_level": "CAN_VIEW"
              }
            ]

Run notebooks in different Databricks Workspaces

In this example, we supply the databricks-host and databricks-token inputs to each databricks/run-notebook step to trigger notebook execution against different workspaces. The tokens are read from the GitHub repository secrets, DATABRICKS_DEV_TOKEN and DATABRICKS_STAGING_TOKEN and DATABRICKS_PROD_TOKEN.

Note that for Azure workspaces, you simply need to generate an AAD token once and use it across all workspaces.

name: Run a notebook in the current repo on pushes to main

on:
  push
    branches:
      - main

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repo
        uses: actions/checkout@v2
      - name: Trigger notebook in staging
        uses: databricks/run-notebook@v0
        with:
          databricks-host: https://xxx-staging.cloud.databricks.com
          databricks-token: ${{ secrets.DATABRICKS_STAGING_TOKEN }}
          local-notebook-path: notebooks/MainNotebook.py
          # The cluster JSON below is for AWS workspaces. On Azure and GCP, set
          # node_type_id to an appropriate node type, e.g. "Standard_D3_v2" for
          # Azure or "n1-highmem-4" for GCP
          new-cluster-json: >
            {
              "num_workers": 1,
              "spark_version": "11.3.x-scala2.12",
              "node_type_id": "i3.xlarge"
            }
          # Grant users in the "devops" group view permission on the
          # notebook results
          access-control-list-json: >
            [
              {
                "group_name": "devops",
                "permission_level": "CAN_VIEW"
              }
            ]
      - name: Trigger notebook in prod
        uses: databricks/run-notebook@v0
        with:
          databricks-host: https://xxx-prod.cloud.databricks.com
          databricks-token: ${{ secrets.DATABRICKS_PROD_TOKEN }}
          local-notebook-path: notebooks/MainNotebook.py
          # The cluster JSON below is for AWS workspaces. On Azure and GCP, set
          # node_type_id to an appropriate node type, e.g. "Standard_D3_v2" for
          # Azure or "n1-highmem-4" for GCP
          new-cluster-json: >
            {
              "num_workers": 1,
              "spark_version": "11.3.x-scala2.12",
              "node_type_id": "i3.xlarge"
            }
          # Grant users in the "devops" group view permission on the
          # notebook results
          access-control-list-json: >
            [
              {
                "group_name": "devops",
                "permission_level": "CAN_VIEW"
              }
            ]

Troubleshooting

To enable debug logging for Databricks REST API requests (e.g. to inspect the payload of a bad /api/2.0/jobs/runs/submit Databricks REST API request), you can set the ACTIONS_STEP_DEBUG action secret to true. See Step Debug Logs for further details.

License

The scripts and documentation in this project are released under the Apache License, Version 2.0.