dbt-labs / spark-utils

Utility functions for dbt projects running on Spark
https://hub.getdbt.com/fishtown-analytics/spark_utils/latest/
Apache License 2.0
30 stars 15 forks source link
dbt macros spark

This dbt package contains macros that:

Installation Instructions

Check dbt Hub for the latest installation instructions, or read the docs for more information on installing packages.


Compatibility

This package provides "shims" for:

In order to use these "shims," you should set a dispatch config in your root project (on dbt v0.20.0 and newer). For example, with this project setting, dbt will first search for macro implementations inside the spark_utils package when resolving macros from the dbt_utils namespace:

dispatch:
  - macro_namespace: dbt_utils
    search_order: ['spark_utils', 'dbt_utils']

Note to maintainers of other packages

The spark-utils package may be able to provide compatibility for your package, especially if your package leverages dbt-utils macros for cross-database compatibility. This package does not need to be specified as a dependency of your package in packages.yml. Instead, you should encourage anyone using your package on Apache Spark / Databricks to:


Useful macros: maintenance

Caveat: These are not tested in CI, or guaranteed to work on all platforms.

Each of these macros accepts a regex pattern, finds tables with names matching the pattern, and will loop over those tables to perform a maintenance operation:


Contributing

We welcome contributions to this repo! To contribute a new feature or a fix, please open a Pull Request with 1) your changes and 2) updated documentation for the README.md file.

Testing

The macros are tested with pytest and pytest-dbt-core. For example, the create_tables macro is tested by:

  1. Create a test table (test setup):
    spark_session.sql(f"CREATE TABLE {table_name} (id int) USING parquet")
  2. Call the macro generator:
    tables = macro_generator()
  3. Assert test condition:
    assert simple_table in tables
  4. Delete the test table (test cleanup):
    spark_session.sql(f"DROP TABLE IF EXISTS {table_name}")

A macro is fetched using the macro_generator fixture and providing the macro name trough indirect parameterization:

@pytest.mark.parametrize(
    "macro_generator", ["macro.spark_utils.get_tables"], indirect=True
)
def test_create_table(macro_generator: MacroGenerator) -> None:

Getting started with dbt + Spark

Code of Conduct

Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct.