Azure-Samples / azure-databricks-mlops-mlflow

Azure Databricks MLOps sample for Python based source code using MLflow without using MLflow Project.
MIT License
79 stars 51 forks source link
azure databricks databricks-notebooks ml ml-monitoring ml-operations ml-ops ml-production mlflow mlflow-projects mlops python

page_type: sample ms.custom:

Azure Databricks MLOps using MLflow

This is a template or sample for MLOps for Python based source code in Azure Databricks using MLflow without using MLflow Project.

This template provides the following features:

Problem Summary

Products/Technologies/Languages Used

Architecture

Model Training

Model Training

Batch Scoring

Batch Scoring

Individual Components

Getting Started

Prerequisites

Development

  1. git clone https://github.com/Azure-Samples/azure-databricks-mlops-mlflow.git
  2. cd azure-databricks-mlops-mlflow
  3. Open cloned repository in Visual Studio Code Remote Container
  4. Open a terminal in Remote Container from Visual Studio Code
  5. make install to install sample packages (taxi_fares and taxi_fares_mlops) locally
  6. make test to Unit Test the code locally

Package

  1. make dist to build wheel Ml and MLOps packages (taxi_fares and taxi_fares_mlops) locally

Deployment

  1. make databricks-deploy-code to deploy Databricks Orchestrator Notebooks, ML and MLOps Python wheel packages. If any code changes.
  2. make databricks-deploy-jobs to deploy Databricks Jobs. If any changes in job specs.

Run training and batch scoring

  1. To trigger training, execute make run-taxi-fares-model-training
  2. To trigger batch scoring, execute make run-taxi-fares-batch-scoring

NOTE: for deployment and running the Databricks environment should be created first, for creating a demo environment the Demo chapter can be followed.

Observability

Check Logs, create alerts. etc. in Application Insights. Following are the few sample Kusto Query to check logs, traces, exception, etc.

To correlate dependencies, exceptions and traces, operation_Id can be used a filter to above Kusto Queries.

Demo

  1. Create Databricks workspace, a storage account (Azure Data Lake Storage Gen2) and Application Insights
    1. Create an Azure Account
    2. Deploy resources from custom ARM template
  2. Initialize Databricks (create cluster, base workspace, mlflow experiment, secret scope)
    1. Get Databricks CLI Host and Token
    2. Authenticate Databricks CLI make databricks-authenticate
    3. Execute make databricks-init
  3. Create Azure Data Lake Storage Gen2 Container and upload data
    1. Create Azure Data Lake Storage Gen2 Container named - taxifares
    2. Upload as blob taxi-fares data files into Azure Data Lake Storage Gen2 container named - taxifares
  4. Put secrets to Mount ADLS Gen2 Storage using Shared Access Key
    1. Get Azure Data Lake Storage Gen2 account name created in step 1
    2. Get Shared Key for Azure Data Lake Storage Gen2 account
    3. Execute make databricks-secrets-put to put secret in Databricks secret scope
  5. Put Application Insights Key as a secret in Databricks secret scope (optional)
    1. Get Application Insights Key created in step 1
    2. Execute make databricks-add-app-insights-key to put secret in Databricks secret scope
  6. Package and deploy into Databricks (Databricks Jobs, Orchestrator Notebooks, ML and MLOps Python wheel packages)
    1. Execute make deploy
  7. Run Databricks Jobs
    1. To trigger training, execute make run-taxifares-model-training
    2. To trigger batch scoring, execute make run-taxifares-batch-scoring
  8. Expected results
    1. Azure resources Azure resources
    2. Databricks jobs Databricks jobs
    3. Databricks mlflow experiment Databricks mlflow experiment
    4. Databricks mlflow model registry Databricks mlflow model registry
    5. Output of batch scoring Output of batch scoring

Additional Details

  1. Continuous Integration (CI) & Continuous Deployment (CD)
  2. Registered Models Stages and Transitioning

Related resources

  1. Azure Databricks
  2. MLflow
  3. MLflow Project
  4. Run MLflow Projects on Azure Databricks
  5. Databricks Widgets
  6. Databricks Notebook-scoped Python libraries
  7. Databricks CLI
  8. Azure Data Lake Storage Gen2
  9. Application Insights
  10. Kusto Query Language

Glossaries

  1. Application developer : It is a role that work mainly towards operationalize of machine learning.
  2. Data scientist : It is a role to perform the data science parts of the project

Contributors