fivetran / dbt_fivetran_log

Data models for Fivetran's internal log connector built using dbt.
https://fivetran.github.io/dbt_fivetran_log/
Apache License 2.0
30 stars 24 forks source link

feature/databricks-delta-incremental-support #130

Closed fivetran-joemarkiewicz closed 2 months ago

fivetran-joemarkiewicz commented 3 months ago

PR Overview

This PR will address the following Issue/Feature: Internal tickets and Issue #128

This PR will result in the following new package version: v1.8.0

When I tested this locally for Databricks there was actually no error when running without a full refresh. However, the table format did not change. Therefore, a breaking change should be leveraged to ensure a full refresh is ran and the delta table format is applied.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

🚨 Breaking Changes 🚨

⚠️ Since the following changes result in the table format changing, we recommend running a --full-refresh after upgrading to this version to avoid possible incremental failures.

  • For Databricks All Purpose clusters the fivetran_platform__audit_table model will now be materialized using the delta table format (previously parquet).
  • Delta tables are generally more performant than parquet and are also more widely available for Databricks users. Previously, the parquet file format was causing compilation issues on customers managed tables.

Documentation Updates

  • Updated the sync_start and sync_end field descriptions for the fivetran_platform__audit_table to explicitly define that these fields only represent the sync start/end times for when the connector wrote new or modified existing records to the specified table.

Under the Hood

  • The is_databricks_sql_warehouse macro has been renamed to is_databricks_all_purpose and has been modified to return true if the Databricks runtime being used is an all purpose cluster (previously this macro checked if a sql warehouse runtime was used).
    • This update was applied as there have been other Databricks runtimes discovered (ie. an endpoint and external runtime) which do not support the insert-overwrite incremental strategy used in the fivetran_platform__audit_table model.
  • In addition to the above, for Databricks users the fivetran_platform__audit_table model will now leverage the incremental strategy only if the Databricks runtime is all purpose. Otherwise, all other Databricks runtimes will not leverage an incremental strategy.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

Before marking this PR as "ready for review" the following have been applied:

Detailed Validation

Please share any and all of your validation steps:

To validate these changes the validation tests were included and you can see they were successful for the following destinations:

Additionally, I validated that the All Purpose Cluster appropriately runs an incremental strategy and the non All Purpose (SQL Warehouse in this case) does not run an incremental strategy.

Finally, I confirmed that the Delta format runs as expected and without issue on the Databricks All Purpose cluster on incremental runs. image

If you had to summarize this PR in an emoji, which would it be?

🌳
fivetran-joemarkiewicz commented 3 months ago

The SQL Server buildkite test is currently failing, but that is due to a permission issue which should hopefully be resolved soon. I will re-kick off the integration tests once that is resolved. I don't imagine SQL Server would be failing for any of these changes; therefore, this should be good to review even with the failing buildkite test.

fivetran-catfritz commented 3 months ago

@fivetran-avinash really really good catch. I ran this only in Databricks so definitely didn't catch that then!

@fivetran-joemarkiewicz Tagging on to Avinash's comments, a more future-proof way to handle the logic update might be:

  1. Put the macro back to just check if it's runtime.
  2. Set the config like:
    ...
    materialized='table' if target.type == 'databricks' 
    and not is_databricks_runtime(or whatever the old version name was)() else 'incremental'
    ...

    That way we don't have to list out the other warehouses. What do you think?

fivetran-joemarkiewicz commented 3 months ago

@fivetran-catfritz I like that idea, but the benefit of listing out each of the warehouses is we are explicitly only using the incremental strategy if we know the destination is supported. If it is not in our supported list, then we use the table materialization. This likely will provide the greatest opportunity for success if for some reason a destination not supported is used to run this model.

fivetran-catfritz commented 3 months ago

@fivetran-joemarkiewicz Makes sense--in that case approved on my end!