dbt-msft / tsql-utils

dbt-utils for the dbt-msft family of packages
MIT License
25 stars 25 forks source link

Convert surrogate keys from hash to int for Power Bi #33

Open infused-kim opened 3 years ago

infused-kim commented 3 years ago

Hey Guys,

yes... ANOTHER surrogate key issue from me... But I think you guys are the right people to nerd around with for this type of thing :)

Problem Description

I am visualizing the models I create with DBT in Power BI and I imagine many of you are doing the same.

One of the best practices in Power BI is to keep columns small and reduce their uniqueness. Unfortunately hash surrogate keys don't fit that best practice at all.

You can't use the binary column type for relationships and instead have to use the string type. So all your keys are 32 character long and highly unique.

I analysed my model with Dax Studio and found that a very large portion of the model size is due to surrogate keys.

So one negative impact that is definitely happening is a bigger model size and RAM consumption of the Power BI datasets.

Additionally, there are some claims that it also affects the performance if you create relationships on large text columns.

Here is an article explaining the performance impact them and here is a reddit discussion.

To be honest, I am skeptical about that. Since relationships are pre-defined, it would be silly if Power BI didn't optimize away the strings... (but I don't know for sure).

Is this even a problem that needs solving?

So my first question is: How do you handle the problem? Or do you think it's not important to handle at all?

Does the simplicity of DBT's hash keys and the productivty increase of DBT outweight the downsides when using Power BI for you?

Do you have any idea on how to benchmark whether hash keys have an impact on query performance for imported Power BI datasets?

Possible solution

I have also come up with a prototype to solve this issue. I have not implemented it in production yet (and I am still undecided whether I actually should), but it seems to work well enough to get your opinion here.

Dimension table example

The model below has a test_int_key column that is set to null when the model is built.

It also has a post_hook that updates that column with a int value corresponding to the hash key.

{{ 
    config(
        materialized='table', 
        as_columnstore = false,
        post_hook=["
            {{ gen_int_surrogate_keys(this, 'test_key', 'test_int_key')}}
        "]
    )
}}

with src_test as (

    select * from {{ref('src_test')}}

),

dim_test as (
    select
        {{ dbt_utils.surrogate_key(["test_id"]) }} as test_key,
        cast(null as int) as test_int_key, -- Will be replaced by final int key in macro
        status,
        foo,
        bar
    from base_test
)
select * from dim_test

Macro that maintains a table with int keys for each hash key

This macro creates a table for the primary key of the dimension with an identity property for the int_key and another column for the hash_key.

When the post_hook runs this macro, it merges all new hash keys into the table, which generates new rows with a new unique int_key for them (due to the identity column property).

Then the macro updates the int_key column of the dimension table with int keys that correspond to the hash keys.

{%- macro gen_int_surrogate_keys(this,
                                 hash_key_col_name,
                                 int_key_col_name) -%}
    {% 
        set int_table = this.schema + '.int_keys_' + this.table
    %}

    -- Create table to store int keys for hash keys if it doesn't exist
    if object_id ('{{ int_table }}', 'U') is null
    begin
        create table {{ int_table }} (
            int_key int identity(1,1) not null,
            hash_key varbinary(8000)
        )
        {% set idx_name = 'int_keys_' + this.table + '__index_on_hash_key' %}
        create nonclustered index {{ idx_name }}
            on {{ int_table }} (hash_key)
    end;

    -- Merge new hash keys that are not in int_table yet
    with hash_key_data as (
        select
            {{ hash_key_col_name }} as hash_key
        from {{ this }}
    )
    merge {{ int_table }} target_tbl
    using hash_key_data src_tbl
        on target_tbl.hash_key = src_tbl.hash_key
    when not matched by target
    then insert (hash_key) values (src_tbl.hash_key);

    -- Update orig table's int_key column with int keys
    update 
        {{ this }}
    set 
        {{ this }}.{{ int_key_col_name }} = int_key.int_key 
    from {{ int_table }} int_key
    where 
        {{ this }}.{{ hash_key_col_name }} = int_key.hash_key
{%- endmacro -%}

Fact Table View

I generate table models for my facts and dimensions. And then I create views with renamed columns that I actually import in Power BI.

The underlying dimension tables contain the int_key columns. The fact tables are not regenerated to include the int key columns.

Instead I join to the dimensions in the report views on the hash key, and then retrieve the int key from the dimension:

with fact_test as (

    select * from {{ref('fact_test')}}

),

dim_test as (

    select * from {{ref('dim_test')}}

),

final as (
    select 
        -- Use the int key as the key instead of the hash key
        dim_test.test_int_key as TestKey,
        spend as Spend,
        impressions as Impressions,
        clicks as Clicks,
        conversions as Conversions,
    from fact_test

    -- Join to the dimension table using the hash key
    -- This way we don't have to recreate the actual 
    -- fact tables with the new int keys.
    -- We do dynamically in the report views
    left join dim_test
    on fact_test.test_key = dim_test.test_key
)

select * from final

What do you think?

This increases the model generation and probably makes the Power BI refresh a bit longer due to the join in the report view.

But the upside is that the model size and ram consumption would be lower. It also has potentially to improve query performance.

Do you think this is an issue worth bothering with? Or is the upside too small?

And what do you think of my idea to solve it? Do you perhaps have improvement suggestions or other ideas of how this could be solved better?

And do you think there is a big enough need for this that we should integrate the solution into tsql_tools?

fritz3000g commented 2 years ago

I'm very interested in this issue. Thank you for sharing your thoughts!

infused-kim commented 2 years ago

I'm very interested in this issue. Thank you for sharing your thoughts!

yay, finally someone who cares about this too! :)

What are your thoughts @fritz3000g?

fritz3000g commented 2 years ago

It looks like you're basically maintaining a key map at the same grain as the fact table. So if you truncate the fact table you can keep the key map unless you're changing the grain. My basic question is performance, for example using a 1,000,000 row fact table and 100,000 row dimension with guid keys

What would be your load time on:

What would be the size and performance of the data model in:

And all of this compared for the same workload using the guid keys directly vs. the integer keys. Maintaining a key map isn't that unusual of a thing to do in my mind, though it's only slightly less annoying than maintaining surrogate keys in the fact table. I'd be interested to see it at work though, and know how it performs.

David-Berumen commented 1 year ago

Hey @infused-kim did you continue testing more about this issue? I don't want to go the hash route because that will just increase the size of all the things I am importing to power bi but there does not seem to be a better alternative (I am using snowflake) 🤔

infused-kim commented 1 year ago

Hey @infused-kim did you continue testing more about this issue? I don't want to go the hash route because that will just increase the size of all the things I am importing to power bi but there does not seem to be a better alternative (I am using snowflake) 🤔

I have been using hash keys for the time being, but I think my solution above should work.

David-Berumen commented 1 year ago

Yeah it should work, we are doing our attempt at int surrogates following these steps:

We intentionally left out surrogate key generation outside of dbt to avoid issues with multiple environments in different databases and schemas having different surrogate keys.