Kedro plugin to develop Kedro pipelines for Databricks. This plugin strives to provide the ultimate developer experience when using Kedro on Databricks. The plugin provides three main features:
To install the plugin, simply run:
pip install kedro-databricks
Now you can use the plugin to develop Kedro pipelines for Databricks.
Before you begin, ensure that the Databricks CLI is installed and configured. For more information on installation and configuration, please refer to the Databricks CLI documentation.
To create a project based on this starter, ensure you have installed Kedro into a virtual environment. Then use the following command:
pip install kedro
Soon you will be able to initialize the databricks-iris
starter with the following command:
kedro new --starter="databricks-iris"
After the project is created, navigate to the newly created project directory:
cd <my-project-name> # change directory
Install the required dependencies:
pip install -r requirements.txt
pip install kedro-databricks
Now you can nitialize the Databricks asset bundle
kedro databricks init
Next, generate the Asset Bundle resources definition:
kedro databricks bundle
Finally, deploy the Kedro project to Databricks:
kedro databricks deploy
That's it! Your pipelines have now been deployed as a workflow to Databricks as [dev <user>] <project_name>
. Try running the workflow to see the results.
kedro databricks init
To initialize a Kedro project for Databricks, run:
kedro databricks init
This command will create the following files:
├── databricks.yml # Databricks Asset Bundle configuration
├── conf/
│ └── base/
│ └── databricks.yml # Workflow overrides
The databricks.yml
file is the main configuration file for the Databricks Asset Bundle. The conf/base/databricks.yml
file is used to override the Kedro workflow configuration for Databricks.
Override the Kedro workflow configuration for Databricks in the conf/base/databricks.yml
file:
# conf/base/databricks.yml
default: # will be applied to all workflows
job_clusters:
- job_cluster_key: default
new_cluster:
spark_version: 7.3.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 2
spark_env_vars:
KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
tasks: # will be applied to all tasks in each workflow
- task_key: default
job_cluster_key: default
<workflow-name>: # will only be applied to the workflow with the specified name
job_clusters:
- job_cluster_key: high-concurrency
new_cluster:
spark_version: 7.3.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 2
spark_env_vars:
KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
tasks:
- task_key: default # will be applied to all tasks in the specified workflow
job_cluster_key: high-concurrency
- task_key: <my-task> # will only be applied to the specified task in the specified workflow
job_cluster_key: high-concurrency
The plugin loads all configuration named according to conf/databricks*
or conf/databricks/*
.
kedro databricks bundle
To generate Asset Bundle resources definition, run:
kedro databricks bundle
This command will generate the following files:
├── resources/
│ ├── <project>.yml # Asset Bundle resources definition corresponds to `kedro run`
│ └── <project-pipeline>.yml # Asset Bundle resources definition for each pipeline corresponds to `kedro run --pipeline <pipeline-name>`
The generated resources definition files are used to define the resources required to run the Kedro pipeline on Databricks.
kedro databricks deploy
To deploy a Kedro project to Databricks, run:
kedro databricks deploy
This command will deploy the Kedro project to Databricks. The deployment process includes the following steps:
/conf
files to Databricks/data/raw/*
and ensure other /data
directories are created