alittlesliceoftom / insert_by_timeperiod

Dbt package for advanced materialisation "insert by timeperiod" which builds on insert by period implementations. Currently tested in Azure Synaspse only.
Apache License 2.0
2 stars 1 forks source link

[Synapse]: Initial Release #1

Closed alittlesliceoftom closed 1 month ago

alittlesliceoftom commented 2 months ago

This PR sets up a working setup for a Synapse instance.

This was tested by importing the package from branch and running an ibtp model (from M-KOPA's instance). The model ran fine.

CI Is not yet established for this repo.

Versions used: dbt-core 1.4.9 dbt-sqlserver 1.4.3 dbt-synapse 1.4.0

This PR sets up a working set of files that have been tested via import into M-KOPA's instance. The main fix on the PR is to reference the insert_by_period.function() instead of just function() !

alittlesliceoftom commented 1 month ago

Features of release summarised by current topline of README:

NOTE: Switching from insert_by_period to insert_by_timeperiod is a breaking change for models, IBTP is only known to work on Microsoft Synapse (other dbs to come, contributitions and tests welcome!) and IBTP had a separate developement path from latest IBP materialisation. Compared to "insert_by_period" accessible in dbt-labs-experimental-features , we have retained slightly different naming to make this distinction clear.

insert_by_timeperiod (IBTP) allows dbt to insert records into a table one period (i.e. day, week, month or year) at a time.

This materialisation is supported only for synapse. For other adapters, utilise the insert_by_period macros.

This materialization is appropriate for event data that can be processed in discrete periods. It is similar in concept to the built-in incremental materialization, but has the added benefit of building the model in chunks even during a full-refresh so is particularly useful for models where the initial run can be problematic.

Should a run of a model using this materialization be interrupted, a subsequent run will continue building the target table from where it was interrupted (granted the --full-refresh flag is omitted).

Progress is logged in the command line for easy monitoring.

The synapse insert_by_timeperiod materialisation includes a couple of differences from the other implementations:

  1. CLI supported backfill mode with insertable start and end dates.

    • Enables you to backfill a range of dates
    • Backfill is done by DELETING whole date, then inserting replacement records (this is performant and simple)
    • Make backfill take priority over full refresh, this is a key change as it means you can run a combined backfill (IBP model) and full refresh (incremental models) at the same time. This enables you to run complex graphs of IBTP and incremental models, backfilling a date range, and they just work.
    • Backfill deletion is done by the timestamp_field
  2. Customisable FROM and TO dates inside the materialisation loop enable you to include WINDOW functions in your code.

    • This requires sligthly more effort to use than the insert_by_period approach, but is more explicit and configurable, please see usage section below.