databricks / mlops-stacks

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.
https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html
Apache License 2.0
416 stars 141 forks source link
databricks machine-learning mlops

Databricks MLOps Stacks

NOTE: This feature is in public preview.

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.

Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML resources management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured. More information can be found at https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html.

The default stack in this repo includes three modular components:

Component Description Why it's useful
ML Code Example ML project structure (training and batch inference, etc), with unit tested Python modules and notebooks Quickly iterate on ML problems, without worrying about refactoring your code into tested modules for productionization later on.
ML Resources as Code ML pipeline resources (training and batch inference jobs, etc) defined through databricks CLI bundles Govern, audit, and deploy changes to your ML resources (e.g. "use a larger instance type for automated model retraining") through pull requests, rather than adhoc changes made via UI.
CI/CD(GitHub Actions or Azure DevOps) GitHub Actions or Azure DevOps workflows to test and deploy ML code and resources Ship ML code faster and with confidence: ensure all production changes are performed through automation and that only tested code is deployed to prod

See the FAQ for questions on common use cases.

ML pipeline structure and development loops

An ML solution comprises data, code, and models. These resources need to be developed, validated (staging), and deployed (production). In this repository, we use the notion of dev, staging, and prod to represent the execution environments of each stage.

An instantiated project from MLOps Stacks contains an ML pipeline with CI/CD workflows to test and deploy automated model training and batch inference jobs across your dev, staging, and prod Databricks workspaces.

Data scientists can iterate on ML code and file pull requests (PRs). This will trigger unit tests and integration tests in an isolated staging Databricks workspace. Model training and batch inference jobs in staging will immediately update to run the latest code when a PR is merged into main. After merging a PR into main, you can cut a new release branch as part of your regularly scheduled release process to promote ML code changes to production.

Develop ML pipelines

https://github.com/databricks/mlops-stacks/assets/87999496/00eed790-70f4-4428-9f18-71771051f92a

Create a PR and CI

https://github.com/databricks/mlops-stacks/assets/87999496/f5b3c82d-77a5-4ee5-85f5-8f00b026ae05

Merge the PR and deploy to Staging

https://github.com/databricks/mlops-stacks/assets/87999496/7239e4d0-2327-4d30-91cc-5e7f8328ef73

https://github.com/databricks/mlops-stacks/assets/87999496/013c0d32-c283-494b-8c3f-2a9a60366207

Deploy to Prod

https://github.com/databricks/mlops-stacks/assets/87999496/0d220d55-465e-4a69-bd83-1e66ad2e8464

See this page for detailed description and diagrams of the ML pipeline structure defined in the default stack.

Using MLOps Stacks

Prerequisites

Databricks CLI contains Databricks asset bundle templates for the purpose of project creation.

Please follow the instruction to install and set up databricks CLI. Releases of databricks CLI can be found in the releases section of databricks/cli repository.

Databricks asset bundles and Databricks asset bundle templates are in public preview.

Start a new project

To create a new project, run:

databricks bundle init mlops-stacks

This will prompt for parameters for initialization. Some of these parameters are required to get started:

Others must be correctly specified for CI/CD to work:

Or used for project initialization:

See the generated README.md for next steps!

Customize MLOps Stacks

Your organization can use the default stack as is or customize it as needed, e.g. to add/remove components or adapt individual components to fit your organization's best practices. See the stack customization guide for more details.

FAQ

Do I need separate dev/staging/prod workspaces to use MLOps Stacks?

We recommend using separate dev/staging/prod Databricks workspaces for stronger isolation between environments. For example, Databricks REST API rate limits are applied per-workspace, so if using Databricks Model Serving, using separate workspaces can help prevent high load in staging from DOSing your production model serving endpoints.

However, you can create a single workspace stack, by supplying the same workspace URL for input_databricks_staging_workspace_host and input_databricks_prod_workspace_host. If you go this route, we recommend using different service principals to manage staging vs prod resources, to ensure that CI workloads run in staging cannot interfere with production resources.

I have an existing ML project. Can I productionize it using MLOps Stacks?

Yes. Currently, you can instantiate a new project and copy relevant components into your existing project to productionize it. MLOps Stacks is modularized, so you can e.g. copy just the GitHub Actions workflows under .github or ML resource configs under {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/resources and {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml into your existing project.

Can I adopt individual components of MLOps Stacks?

For this use case, we recommend instantiating via Databricks asset bundle templates and copying the relevant subdirectories. For example, all ML resource configs are defined under {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/resources and {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml, while CI/CD is defined e.g. under .github if using GitHub Actions, or under .azure if using Azure DevOps.

Can I customize my MLOps Stack?

Yes. We provide the default stack in this repo as a production-friendly starting point for MLOps. However, in many cases you may need to customize the stack to match your organization's best practices. See the stack customization guide for details on how to do this.

Does the MLOps Stacks cover data (ETL) pipelines?

Since MLOps Stacks is based on databricks CLI bundles, it's not limited only to ML workflows and resources - it works for resources across the Databricks Lakehouse. For instance, while the existing ML code samples contain feature engineering, training, model validation, deployment and batch inference workflows, you can use it for Delta Live Tables pipelines as well.

How can I provide feedback?

Please provide feedback (bug reports, feature requests, etc) via GitHub issues.

Contributing

We welcome community contributions. For substantial changes, we ask that you first file a GitHub issue to facilitate discussion, before opening a pull request.

MLOps Stacks is implemented as a Databricks asset bundle template that generates new projects given user-supplied parameters. Parametrized project code can be found under the {{.input_root_dir}} directory.

Installing development requirements

To run tests, install actionlint, databricks CLI, npm, and act, then install the Python dependencies listed in dev-requirements.txt:

pip install -r dev-requirements.txt

Running the tests

NOTE: This section is for open-source developers contributing to the default stack in this repo. If you are working on an ML project using the stack (e.g. if you ran databricks bundle init to start a new project), see the README.md within your generated project directory for detailed instructions on how to make and test changes.

Run unit tests:

pytest tests

Run all tests (unit and slower integration tests):

pytest tests --large

Run integration tests only:

pytest tests --large-only

Previewing changes

When making changes to MLOps Stacks, it can be convenient to see how those changes affect a generated new ML project. To do this, you can create an example project from your local checkout of the repo, and inspect its contents/run tests within the project.

We provide example project configs for Azure (using both GitHub and Azure DevOps), AWS (using GitHub), and GCP (using GitHub) under tests/example-project-configs. To create an example Azure project, using Azure DevOps as the CI/CD platform, run the following from the desired parent directory of the example project:

# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/azure/azure-devops.json"

To create an example AWS project, using GitHub Actions for CI/CD, run:

# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/aws/aws-github.json"

To create an example GCP project, using GitHub Actions for CI/CD, run:

# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/gcp/gcp-github.json"