dbt-profiler
implements dbt macros for profiling database relations and creating doc
blocks and table schemas (schema.yml
) containing said profiles. A calculated profile contains the following measures for each column in a relation:
column_name
: Name of the columndata_type
: Data type of the columnnot_null_proportion
^: Proportion of column values that are not NULL
(e.g., 0.62
means that 62% of the values are populated while 38% are NULL
)distinct_proportion
^: Proportion of unique column values (e.g., 1
means that 100% of the values are unique)distinct_count
^: Count of unique column valuesis_unique
^: True if all column values are uniquemin
*^: Minimum column valuemax
*^: Maximum column valueavg
**^: Average column valuemedian
**^: Median column valuestd_dev_population
**^: Population standard deviationstd_dev_sample
**^: Sample standard deviationprofiled_at
: Profile calculation date and time*
numeric, date and time columns only
**
numeric columns only
^
can be excluded from the profile using exclude_measures
argument
dbt-profiler
aims to provide the following:
For the third point there are at least two options:
meta
properties, and doc
blocks. An example of the first is implemented in the print_profile_schema macro. The second can be achieved with the following pattern:
"docs"
folder explicitly to dbt_project.yml
via model-paths
model-paths: ["models", "docs"]
docs
macrodocs/dbt_profiler/<model>.md
file
# docs/dbt_profiler/customer.md
{% docs dbt_profiler__customer %}
column_name | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | median | std_dev_population | std_dev_sample | profiled_at |
---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | int64 | 1.00 | 1.00 | 100 | 1 | 1 | 100 | 50.5000000000000000 | 50 | 28.8660700477221200 | 29.0114919758820200 | 2022-01-13 10:14:48.300040+00 |
first_order | date | 0.62 | 0.46 | 46 | 0 | 2018-01-01 | 2018-04-07 | 2022-01-13 10:14:48.300040+00 | ||||
most_recent_order | date | 0.62 | 0.52 | 52 | 0 | 2018-01-09 | 2018-04-09 | 2022-01-13 10:14:48.300040+00 | ||||
number_of_orders | int64 | 0.62 | 0.04 | 4 | 0 | 1 | 5 | 1.5967741935483863 | 1 | 0.7716692718648833 | 0.7779687173818426 | 2022-01-13 10:14:48.300040+00 |
customer_lifetime_value | float64 | 0.62 | 0.35 | 35 | 0 | 1 | 99 | 26.9677419354838830 | 22 | 18.6599171435558730 | 18.8122455252636630 | 2022-01-13 10:14:48.300040+00 |
{% enddocs %}
3. Include the profile in a model description using the `doc` macro
```yml
version: 2
models:
- name: customer
description: |
Represents a customer.
`dbt-profiler` results:
{{ doc("dbt_profiler__customer") }}
columns:
- name: customer_id
tests:
- not_null
- unique
One of the advantages of the doc
approach over the meta
approach is that it doesn't require changes to the schema.yml except for the doc
macro call. Once the macro call has been embedded in the schema the actual profiles can be maintained in a dedicated dbt_profiler/
directory as Markdown files. The profile files can then be automatically updated by a CI process that runs once a week or month as follows:
List the models you want to profile (e.g., using dbt list --output name -m ${node_selection}
)
For each model run dbt run-operation print_profile_docs --args '{"relation_name": "'${relation_name}'", "schema": "'${schema}'"}'
and store the result in dbt_profiler/${relation_name}.md
dbt run-operation print_profile_docs
output in e.g. a variable before piping it to the target file. Piping the output directly to a file (e.g., dbt run-operation print_profile_docs > ${relation_name}.md
) will result in a situation where the target file is emptied before dbt run-operation
compiles the dbt project which will throw an error if you're already referring to the doc
block that the operation has not yet generated. See example update-relation-profile.sh script.Create a Pull Request for the updated profiles (e.g., using create-pull-request GitHub Action)
dbt-profiler
requires dbt version >=1.1.0
. Check dbt Hub for the latest installation instructions.
dbt-profiler
may work with unsupported adapters but they haven't been tested yet. If you've used dbt-profiler
with any of the unsupported adapters I'd love to hear your feedback (e.g., create an issue, PR or hit me with with a DM on dbt Slack) 😊
✅ AWS Athena
✅ BigQuery
✅ Databricks
✅ PostgreSQL
✅ Redshift
✅ Snowflake
✅ SQL Server
❌ Apache Spark
❌ Presto
This macro returns a relation profile as a SQL query that can be used in a dbt model. This is handy for previewing relation profiles in dbt Cloud.
relation
(required): Relation objectexclude_measures
(optional): List of measures to exclude from the profile (default: []
)include_columns
(optional): List of columns to include in the profile (default: []
i.e., all). Only one of include_columns
and exclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default: []
). Only one of include_columns
and exclude_columns
can be specified at a time.where_clause
(optional): SQL WHERE
clause to allow exclustion of records from profiler.group_by
(optional): SQL group_by
to aggregate data from profiler (default: []
)Use this macro in a dbt model, using a ref():
{{ dbt_profiler.get_profile(relation=ref("customers"), where_clause="is_active = true") }}
Use this macro in a dbt model, using a source():
{{ dbt_profiler.get_profile(relation=source("jaffle_shop","customers"), exclude_measures=["std_dev_population", "std_dev_sample"]) }}
To configure the macro to be called only when dbt is in execute mode:
-- depends_on: {{ ref("customers") }}
{% if execute %}
{{ dbt_profiler.get_profile(relation=ref("customers")) }}
{% endif %}
This macro returns a relation profile as an agate.Table. The macro does not print anything to stdout
and therefore is not meant to be used as a standalone operation.
relation
(either relation
or relation_name
is required): Relation objectrelation_name
(either relation
or relation_name
is required): Relation nameschema
(optional): Schema where relation_name
exists (default: none
i.e., target schema)database
(optional): Database where relation_name
exists (default: none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default: []
)include_columns
(optional): List of columns to include in the profile (default: []
i.e., all). Only one of include_columns
and exclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default: []
). Only one of include_columns
and exclude_columns
can be specified at a time.where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after the WHERE
keyword.Call this macro from another macro or dbt model:
{% set table = dbt_profiler.get_profile_table(relation_name="customers") %}
❗ This macro does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.
This macro prints a relation profile as a Markdown table to stdout
.
relation
(either relation
or relation_name
is required): Relation objectrelation_name
(either relation
or relation_name
is required): Relation nameschema
(optional): Schema where relation_name
exists (default: none
i.e., target schema)database
(optional): Database where relation_name
exists (default: none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default: []
)include_columns
(optional): List of columns to include in the profile (default: []
i.e., all). Only one of include_columns
and exclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default: []
). Only one of include_columns
and exclude_columns
can be specified at a time.max_rows
(optional): The maximum number of rows to display before truncating the data (default: none
i.e., not truncated)max_columns
(optional): The maximum number of columns to display before truncating the data (default: 7
)max_column_width
(optional): Truncate all columns to at most this width (default: 30
)max_precision
(optional): Puts a limit on the maximum precision displayed for number types (default: none
i.e., not limited)where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after the WHERE
keyword.Call the macro as an operation:
dbt run-operation print_profile --args '{"relation_name": "customers"}'
An alternative for dbt Cloud that prints the profile in the console log but not in a Markdown format:
{% set profile = dbt_profiler.get_profile(relation=ref("customers")) %}
{% for row in profile.rows %}
{% do log(row.values(), info=True) %}
{% endfor %}
column_name | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |
---|---|---|---|---|---|---|---|---|---|---|---|
customer_id | int64 | 1.00 | 1.00 | 100 | 1 | 1 | 100 | 50.5000000000000000 | 28.8660700477221200 | 29.0114919758820200 | 2022-01-13 10:14:48.300040+00 |
first_order | date | 0.62 | 0.46 | 46 | 0 | 2018-01-01 | 2018-04-07 | 2022-01-13 10:14:48.300040+00 | |||
most_recent_order | date | 0.62 | 0.52 | 52 | 0 | 2018-01-09 | 2018-04-09 | 2022-01-13 10:14:48.300040+00 | |||
number_of_orders | int64 | 0.62 | 0.04 | 4 | 0 | 1 | 5 | 1.5967741935483863 | 0.7716692718648833 | 0.7779687173818426 | 2022-01-13 10:14:48.300040+00 |
customer_lifetime_value | float64 | 0.62 | 0.35 | 35 | 0 | 1 | 99 | 26.9677419354838830 | 18.6599171435558730 | 18.8122455252636630 | 2022-01-13 10:14:48.300040+00 |
This macro prints a relation schema YAML to stdout
containing all columns and their profiles.
relation
(either relation
or relation_name
is required): Relation objectrelation_name
(either relation
or relation_name
is required): Relation nameschema
(optional): Schema where relation_name
exists (default: none
i.e., target schema)database
(optional): Database where relation_name
exists (default: none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default: []
)include_columns
(optional): List of columns to include in the profile (default: []
i.e., all). Only one of include_columns
and exclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default: []
). Only one of include_columns
and exclude_columns
can be specified at a time.model_description
(optional): Model description included in the schema (default: ""
)column_description
(optional): Column descriptions included in the schema (default: ""
)where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after the WHERE
keyword.Call the macro as an operation:
dbt run-operation print_profile_schema --args '{"relation_name": "customers"}'
version: 2
models:
- name: customers
description: ''
columns:
- name: number_of_orders
description: ''
meta:
data_type: int64
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.04
distinct_count: 4.0
is_unique: 0.0
min: '1'
max: '5'
avg: 1.5967741935483863
std_dev_population: 0.7716692718648833
std_dev_sample: 0.7779687173818426
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: customer_lifetime_value
description: ''
meta:
data_type: float64
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.35
distinct_count: 35.0
is_unique: 0.0
min: '1'
max: '99'
avg: 26.967741935483883
std_dev_population: 18.659917143555873
std_dev_sample: 18.812245525263663
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: customer_id
description: ''
meta:
data_type: int64
row_count: 100.0
not_null_proportion: 1.0
distinct_proportion: 1.0
distinct_count: 100.0
is_unique: 1.0
min: '1'
max: '100'
avg: 50.5
std_dev_population: 28.86607004772212
std_dev_sample: 29.01149197588202
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: first_order
description: ''
meta:
data_type: date
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.46
distinct_count: 46.0
is_unique: 0.0
min: '2018-01-01'
max: '2018-04-07'
avg: null
std_dev_population: null
std_dev_sample: null
profiled_at: '2022-01-13 10:08:18.446822+00'
- name: most_recent_order
description: ''
meta:
data_type: date
row_count: 100.0
not_null_proportion: 0.62
distinct_proportion: 0.52
distinct_count: 52.0
is_unique: 0.0
min: '2018-01-09'
max: '2018-04-09'
avg: null
std_dev_population: null
std_dev_sample: null
profiled_at: '2022-01-13 10:08:18.446822+00'
This what the profile looks like on the dbt docs site:
❗ This macro does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.
This macro prints a relation profile as a Markdown table wrapped in a Jinja docs
macro to stdout
.
relation
(either relation
or relation_name
is required): Relation objectrelation_name
(either relation
or relation_name
is required): Relation nameschema
(optional): Schema where relation_name
exists (default: none
i.e., target schema)database
(optional): Database where relation_name
exists (default: none
i.e., target database)exclude_measures
(optional): List of measures to exclude from the profile (default: []
)include_columns
(optional): List of columns to include in the profile (default: []
i.e., all). Only one of include_columns
and exclude_columns
can be specified at a time.exclude_columns
(optional): List of columns to exclude from the profile (default: []
). Only one of include_columns
and exclude_columns
can be specified at a time.docs_name
(optional): docs
macro name (default: dbt_profiler__{{ relation_name }}
)max_rows
(optional): The maximum number of rows to display before truncating the data (default: none
i.e., not truncated)max_columns
(optional): The maximum number of columns to display before truncating the data (default: 7
)max_column_width
(optional): Truncate all columns to at most this width (default: 30
)max_precision
(optional): Puts a limit on the maximum precision displayed for number types (default: none
i.e., not limited)where_clause
(optional): SQL where clause to allow exclustion of records from profiler. This is done after the WHERE
keyword.Call the macro as an operation:
dbt run-operation print_profile_docs --args '{"relation_name": "customers"}'
{% docs dbt_profiler__customers %}
| column_name | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min | max | avg | std_dev_population | std_dev_sample | profiled_at |
| ----------------------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- | ------------------- | ------------------- | ------------------- | ----------------------------- |
| customer_id | int64 | 1.00 | 1.00 | 100 | 1 | 1 | 100 | 50.5000000000000000 | 28.8660700477221200 | 29.0114919758820200 | 2022-01-13 10:14:48.300040+00 |
| first_order | date | 0.62 | 0.46 | 46 | 0 | 2018-01-01 | 2018-04-07 | | | | 2022-01-13 10:14:48.300040+00 |
| most_recent_order | date | 0.62 | 0.52 | 52 | 0 | 2018-01-09 | 2018-04-09 | | | | 2022-01-13 10:14:48.300040+00 |
| number_of_orders | int64 | 0.62 | 0.04 | 4 | 0 | 1 | 5 | 1.5967741935483863 | 0.7716692718648833 | 0.7779687173818426 | 2022-01-13 10:14:48.300040+00 |
| customer_lifetime_value | float64 | 0.62 | 0.35 | 35 | 0 | 1 | 99 | 26.9677419354838830 | 18.6599171435558730 | 18.8122455252636630 | 2022-01-13 10:14:48.300040+00 |
{% enddocs %}
Profiling a table whose column are integer, date, string in this order raises the following error : ERROR: UNION types text and numeric cannot be matched LINE 60: avg("int_after_date_after_string") as avg, Appropriately casting the null default value solves it.