data-mie / dbt-profiler

Macros for generating dbt model data profiles
Apache License 2.0
81 stars 33 forks source link

CircleCI

dbt-profiler

dbt-profiler implements dbt macros for profiling database relations and creating doc blocks and table schemas (schema.yml) containing said profiles. A calculated profile contains the following measures for each column in a relation:

* numeric, date and time columns only ** numeric columns only ^ can be excluded from the profile using exclude_measures argument

Purpose

dbt-profiler aims to provide the following:

  1. get_profile macro for generating profiling SQL queries that can be used as dbt models or ad-hoc queries
  2. print_profile macro for ad-hoc model profiling to support data exploration
  3. Describe a mechanism to include model profiles in dbt docs

For the third point there are at least two options:

  1. meta properties, and
  2. doc blocks.

An example of the first is implemented in the print_profile_schema macro. The second can be achieved with the following pattern:

  1. Add a "docs" folder explicitly to dbt_project.yml via model-paths
    model-paths: ["models", "docs"]
  2. Use print_profile_docs macro to generate the profile as a Markdown table wrapped in a Jinja docs macro
  3. Copy the output to a docs/dbt_profiler/<model>.md file
    
    # docs/dbt_profiler/customer.md
    {% docs dbt_profiler__customer %}
column_name data_type not_null_proportion distinct_proportion distinct_count is_unique min max avg median std_dev_population std_dev_sample profiled_at
customer_id int64 1.00 1.00 100 1 1 100 50.5000000000000000 50 28.8660700477221200 29.0114919758820200 2022-01-13 10:14:48.300040+00
first_order date 0.62 0.46 46 0 2018-01-01 2018-04-07 2022-01-13 10:14:48.300040+00
most_recent_order date 0.62 0.52 52 0 2018-01-09 2018-04-09 2022-01-13 10:14:48.300040+00
number_of_orders int64 0.62 0.04 4 0 1 5 1.5967741935483863 1 0.7716692718648833 0.7779687173818426 2022-01-13 10:14:48.300040+00
customer_lifetime_value float64 0.62 0.35 35 0 1 99 26.9677419354838830 22 18.6599171435558730 18.8122455252636630 2022-01-13 10:14:48.300040+00

{% enddocs %}

3. Include the profile in a model description using the `doc` macro
```yml
version: 2

models:
  - name: customer
    description: |
      Represents a customer.

      `dbt-profiler` results:

      {{ doc("dbt_profiler__customer") }}
    columns:
      - name: customer_id
        tests:
          - not_null
          - unique

Continuous integration (CI)

One of the advantages of the doc approach over the meta approach is that it doesn't require changes to the schema.yml except for the doc macro call. Once the macro call has been embedded in the schema the actual profiles can be maintained in a dedicated dbt_profiler/ directory as Markdown files. The profile files can then be automatically updated by a CI process that runs once a week or month as follows:

  1. List the models you want to profile (e.g., using dbt list --output name -m ${node_selection})

  2. For each model run dbt run-operation print_profile_docs --args '{"relation_name": "'${relation_name}'", "schema": "'${schema}'"}' and store the result in dbt_profiler/${relation_name}.md

    • Note that you need to store the dbt run-operation print_profile_docs output in e.g. a variable before piping it to the target file. Piping the output directly to a file (e.g., dbt run-operation print_profile_docs > ${relation_name}.md) will result in a situation where the target file is emptied before dbt run-operation compiles the dbt project which will throw an error if you're already referring to the doc block that the operation has not yet generated. See example update-relation-profile.sh script.
  3. Create a Pull Request for the updated profiles (e.g., using create-pull-request GitHub Action)

Installation

dbt-profiler requires dbt version >=1.1.0. Check dbt Hub for the latest installation instructions.

Supported adapters

dbt-profiler may work with unsupported adapters but they haven't been tested yet. If you've used dbt-profiler with any of the unsupported adapters I'd love to hear your feedback (e.g., create an issue, PR or hit me with with a DM on dbt Slack) 😊

✅ AWS Athena

✅ BigQuery

✅ Databricks

✅ PostgreSQL

✅ Redshift

✅ Snowflake

✅ SQL Server

❌ Apache Spark

❌ Presto

Contents

Macros

get_profile (source)

This macro returns a relation profile as a SQL query that can be used in a dbt model. This is handy for previewing relation profiles in dbt Cloud.

Arguments

Usage

Use this macro in a dbt model, using a ref():

{{ dbt_profiler.get_profile(relation=ref("customers"), where_clause="is_active = true") }}

Use this macro in a dbt model, using a source():

{{ dbt_profiler.get_profile(relation=source("jaffle_shop","customers"), exclude_measures=["std_dev_population", "std_dev_sample"]) }}

To configure the macro to be called only when dbt is in execute mode:

-- depends_on: {{ ref("customers") }}
{% if execute %}
    {{ dbt_profiler.get_profile(relation=ref("customers")) }}
{% endif %}

get_profile_table (source)

This macro returns a relation profile as an agate.Table. The macro does not print anything to stdout and therefore is not meant to be used as a standalone operation.

Arguments

Usage

Call this macro from another macro or dbt model:

{% set table = dbt_profiler.get_profile_table(relation_name="customers") %}

print_profile (source)

This macro does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.

This macro prints a relation profile as a Markdown table to stdout.

Arguments

Usage

Call the macro as an operation:

dbt run-operation print_profile --args '{"relation_name": "customers"}'

An alternative for dbt Cloud that prints the profile in the console log but not in a Markdown format:

{% set profile = dbt_profiler.get_profile(relation=ref("customers")) %}
{% for row in profile.rows %}
  {% do log(row.values(), info=True) %}
{% endfor %}

Example output

column_name data_type not_null_proportion distinct_proportion distinct_count is_unique min max avg std_dev_population std_dev_sample profiled_at
customer_id int64 1.00 1.00 100 1 1 100 50.5000000000000000 28.8660700477221200 29.0114919758820200 2022-01-13 10:14:48.300040+00
first_order date 0.62 0.46 46 0 2018-01-01 2018-04-07 2022-01-13 10:14:48.300040+00
most_recent_order date 0.62 0.52 52 0 2018-01-09 2018-04-09 2022-01-13 10:14:48.300040+00
number_of_orders int64 0.62 0.04 4 0 1 5 1.5967741935483863 0.7716692718648833 0.7779687173818426 2022-01-13 10:14:48.300040+00
customer_lifetime_value float64 0.62 0.35 35 0 1 99 26.9677419354838830 18.6599171435558730 18.8122455252636630 2022-01-13 10:14:48.300040+00

print_profile_schema (source)

This macro prints a relation schema YAML to stdout containing all columns and their profiles.

Arguments

Usage

Call the macro as an operation:

dbt run-operation print_profile_schema --args '{"relation_name": "customers"}'

Example output

version: 2
models:
- name: customers
  description: ''
  columns:
  - name: number_of_orders
    description: ''
    meta:
      data_type: int64
      row_count: 100.0
      not_null_proportion: 0.62
      distinct_proportion: 0.04
      distinct_count: 4.0
      is_unique: 0.0
      min: '1'
      max: '5'
      avg: 1.5967741935483863
      std_dev_population: 0.7716692718648833
      std_dev_sample: 0.7779687173818426
      profiled_at: '2022-01-13 10:08:18.446822+00'
  - name: customer_lifetime_value
    description: ''
    meta:
      data_type: float64
      row_count: 100.0
      not_null_proportion: 0.62
      distinct_proportion: 0.35
      distinct_count: 35.0
      is_unique: 0.0
      min: '1'
      max: '99'
      avg: 26.967741935483883
      std_dev_population: 18.659917143555873
      std_dev_sample: 18.812245525263663
      profiled_at: '2022-01-13 10:08:18.446822+00'
  - name: customer_id
    description: ''
    meta:
      data_type: int64
      row_count: 100.0
      not_null_proportion: 1.0
      distinct_proportion: 1.0
      distinct_count: 100.0
      is_unique: 1.0
      min: '1'
      max: '100'
      avg: 50.5
      std_dev_population: 28.86607004772212
      std_dev_sample: 29.01149197588202
      profiled_at: '2022-01-13 10:08:18.446822+00'
  - name: first_order
    description: ''
    meta:
      data_type: date
      row_count: 100.0
      not_null_proportion: 0.62
      distinct_proportion: 0.46
      distinct_count: 46.0
      is_unique: 0.0
      min: '2018-01-01'
      max: '2018-04-07'
      avg: null
      std_dev_population: null
      std_dev_sample: null
      profiled_at: '2022-01-13 10:08:18.446822+00'
  - name: most_recent_order
    description: ''
    meta:
      data_type: date
      row_count: 100.0
      not_null_proportion: 0.62
      distinct_proportion: 0.52
      distinct_count: 52.0
      is_unique: 0.0
      min: '2018-01-09'
      max: '2018-04-09'
      avg: null
      std_dev_population: null
      std_dev_sample: null
      profiled_at: '2022-01-13 10:08:18.446822+00'

This what the profile looks like on the dbt docs site:

dbt docs example

print_profile_docs (source)

This macro does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.

This macro prints a relation profile as a Markdown table wrapped in a Jinja docs macro to stdout.

Arguments

Usage

Call the macro as an operation:

dbt run-operation print_profile_docs --args '{"relation_name": "customers"}'

Example output

{% docs dbt_profiler__customers  %}
| column_name             | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |                 avg |  std_dev_population |      std_dev_sample | profiled_at                   |
| ----------------------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------- | ------------------- | ------------------- | ----------------------------- |
| customer_id             | int64     |                1.00 |                1.00 |            100 |         1 | 1          | 100        | 50.5000000000000000 | 28.8660700477221200 | 29.0114919758820200 | 2022-01-13 10:14:48.300040+00 |
| first_order             | date      |                0.62 |                0.46 |             46 |         0 | 2018-01-01 | 2018-04-07 |                     |                     |                     | 2022-01-13 10:14:48.300040+00 |
| most_recent_order       | date      |                0.62 |                0.52 |             52 |         0 | 2018-01-09 | 2018-04-09 |                     |                     |                     | 2022-01-13 10:14:48.300040+00 |
| number_of_orders        | int64     |                0.62 |                0.04 |              4 |         0 | 1          | 5          |  1.5967741935483863 |  0.7716692718648833 |  0.7779687173818426 | 2022-01-13 10:14:48.300040+00 |
| customer_lifetime_value | float64   |                0.62 |                0.35 |             35 |         0 | 1          | 99         | 26.9677419354838830 | 18.6599171435558730 | 18.8122455252636630 | 2022-01-13 10:14:48.300040+00 |
{% enddocs %}

Contributions

mdutoo: Added date type to tests, fix #37 Error when profiling integer after date after string columns

Profiling a table whose column are integer, date, string in this order raises the following error : ERROR: UNION types text and numeric cannot be matched LINE 60: avg("int_after_date_after_string") as avg, Appropriately casting the null default value solves it.