dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
314 stars 123 forks source link

External sources in dbt

dbt v0.15.0 added support for an external property within sources that can include information about location, partitions, and other database-specific properties.

This package provides:

Supported databases

sample docs

Installation

Follow the instructions at hub.getdbt.com on how to modify your packages.yml and run dbt deps.

Syntax

The stage_external_sources macro is the primary point of entry when using this package. It has two operational modes: standard and "full refresh."

# iterate through all source nodes, create if missing, refresh metadata
$ dbt run-operation stage_external_sources

# iterate through all source nodes, create or replace (+ refresh if necessary)
$ dbt run-operation stage_external_sources --vars "ext_full_refresh: true"

The stage_external_sources macro accepts a limited node selection syntax similar to snapshotting source freshness:

# stage all Snowplow and Logs external sources:
$ dbt run-operation stage_external_sources --args "select: snowplow logs"

# stage a particular external source table:
$ dbt run-operation stage_external_sources --args "select: snowplow.event"

Setup

The macros assume that you:

  1. Have already created your database's required scaffolding for external resources:
    • an external stage (Snowflake)
    • an external schema + S3 bucket (Redshift Spectrum)
    • an external data source and file format (Synapse)
    • an external data source and databse-scoped credential (Azure SQL)
    • a Google Cloud Storage bucket (BigQuery)
    • an accessible set of files (Spark)
  2. Have the appropriate permissions on to create tables using that scaffolding
  3. Have already created the database/project and/or schema/dataset in which dbt will create external tables (or snowpiped tables)

Spec

version: 2

sources:
  - name: snowplow
    tables:
      - name: event
        description: >
            This source table is actually a set of files in external storage.
            The dbt-external-tables package provides handy macros for getting
            those files queryable, just in time for modeling.

        external:
          location:         # required: S3 file path, GCS file path, Snowflake stage, Synapse data source

          ...               # database-specific properties of external table

          partitions:       # optional
            - name: collector_date
              data_type: date
              ...           # database-specific properties

        # Specify ALL column names + datatypes.
        # Column order must match for CSVs, column names must match for other formats.
        # Some databases support schema inference.

        columns:
          - name: app_id
            data_type: varchar(255)
            description: "Application ID"
          - name: platform
            data_type: varchar(255)
            description: "Platform"
          ...

The stage_external_sources macro will use this YAML config to compile and execute the appropriate create, refresh, and/or drop commands:

19:01:48 + 1 of 1 START external source spectrum.my_partitioned_tbl
19:01:48 + 1 of 1 (1) drop table if exists "db"."spectrum"."my_partitioned_tbl"
19:01:48 + 1 of 1 (1) DROP TABLE
19:01:48 + 1 of 1 (2) create external table "db"."spectrum"."my_partitioned_tbl"...
19:01:48 + 1 of 1 (2) CREATE EXTERNAL TABLE
19:01:48 + 1 of 1 (3) alter table "db"."spectrum"."my_partitioned_tbl"...
19:01:49 + 1 of 1 (3) ALTER EXTERNAL TABLE

Resources

If you encounter issues using this package or have questions, please check the open issues, as there's a chance it's a known limitation or work in progress. If not, you can:

Additional contributions to this package are very welcome! Please create issues or open PRs against master. Check out this post on the best workflow for contributing to a package.