bcodell / dbt-activity-schema

A dbt package with a POC implementation of an interface to query activity streams that adhere to the Activity Schema 2.0 spec.
Apache License 2.0
12 stars 2 forks source link

bug - build_activity macro needs explicitly set null_columns when using anonymous_customer_id #47

Closed awoehrl closed 10 months ago

awoehrl commented 10 months ago

I forgot about this, because I added the null_columns attribute when doing my first tests with dbt-aql. If I don't explicitly set null_columns, my anonymous_customer_column will be nulled. Not sure if it's a bug or missing in the documentation though.

dbt_project

vars:
  dbt_aql:
    column_configuration:
      included_optional_columns:
        - link
        - revenue_impact
      column_aliases:
        feature_json: feature_json
    streams:
      customer_stream:
        customer_id_alias: customer_id
        anonymous_customer_id_alias: anonymous_customer_id
        model_prefix: customer__
        skip_stream: true
      test_stream:
        customer_id_alias: entity_id
        anonymous_customer_id_alias: anonymous_entity_id
        model_prefix: test__
      third_stream:
        customer_id_alias: entity_id
        model_prefix: third__

Example activity

{{
    config(
        materialized='table',
        stream='test_stream',
        data_types={
            'article_id': type_string(),
        }

    )
}}

with base as (
  select
    distinct
    event_id || start_tstamp || app_id   as page_view_start_tstamp_id,
    user_id                                           as {{ dbt_aql.customer_column('test_stream') }},
    domain_userid                               as {{ dbt_aql.anonymous_customer_column('test_stream') }},
    start_tstamp                                   as {{ dbt_aql.schema_columns().ts }},

    --features_json
    content.article_id                            as article_id

  from {{ ref('page_views_extended') }}
  where date(start_tstamp) >= current_date() - 28
)

{{ dbt_aql.build_activity(
    cte='base',
    unique_id_column='page_view_start_tstamp_id'
) }}

Output Here anonymous_entity_id will be nulled: cast(null as STRING) as anonymous_entity_id

    create or replace table `dwh-project`.`dev_awoehrl`.`test__viewed_page2`

    OPTIONS(
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 168 hour),

      description=""""""
    )
    as (

with base as (
  select
    distinct
    event_id || start_tstamp || app_id as page_view_start_tstamp_id,
    user_id                    as entity_id,
    domain_userid              as anonymous_entity_id,
    start_tstamp               as ts,

    --features_json
    content.article_id         as article_id

  from `dwh-project`.`dev_awoehrl`.`page_views_extended`
  where
    date(start_tstamp) >= current_date() - 28

)

select
    cast(cast(to_hex(md5(cast(coalesce(cast(entity_id as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast(ts as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast('viewed page2' as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast(page_view_start_tstamp_id as STRING), '_dbt_utils_surrogate_key_null_') as STRING))) as STRING) as STRING) as activity_id
    , cast(entity_id as STRING) as entity_id
    , cast(null as STRING) as anonymous_entity_id
    , cast('viewed page2' as STRING) as activity
    , cast(ts as TIMESTAMP) as ts
    , cast(null as INT64) as revenue_impact
    , cast(null as STRING) as link
    , to_json(struct(
        article_id as article_id
        )) as feature_json
    , row_number() over (
        partition by entity_id
        order by ts, cast(to_hex(md5(cast(coalesce(cast(entity_id as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast(ts as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast('viewed page2' as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast(page_view_start_tstamp_id as STRING), '_dbt_utils_surrogate_key_null_') as STRING))) as STRING)
    ) as activity_occurrence
    , lead(cast(ts as TIMESTAMP)) over (
        partition by entity_id
        order by ts, cast(to_hex(md5(cast(coalesce(cast(entity_id as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast(ts as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast('viewed page2' as STRING), '_dbt_utils_surrogate_key_null_') || '-' || coalesce(cast(page_view_start_tstamp_id as STRING), '_dbt_utils_surrogate_key_null_') as STRING))) as STRING)
    ) as activity_repeated_at
from base

    );  

Explicitly setting null_columns works:

{{
    config(
        materialized='incremental',
        stream='test_stream',
        data_types={
            'article_id': type_string(),
        }

    )
}}

with base as (
  select
    distinct
    event_id || start_tstamp || app_id as page_view_start_tstamp_id,
    user_id                    as {{ dbt_aql.customer_column('test_stream') }},
    domain_userid              as {{ dbt_aql.anonymous_customer_column('test_stream') }},
    start_tstamp               as {{ dbt_aql.schema_columns().ts }},

    --features_json
    content.article_id         as article_id

  from {{ ref('page_views_extended') }}
  where
    date(start_tstamp) >= current_date() - 28
    {% if is_incremental() %}
      and date(start_tstamp)
      > coalesce((select max(date(start_tstamp)) from {{ this }}), '1900-01-01')
    {% endif %}
)

{{ dbt_aql.build_activity(
    cte='base',
    unique_id_column='page_view_start_tstamp_id',
    null_columns=['link','revenue_impact']
) }}
bcodell commented 10 months ago

Thanks for the flag @awoehrl - I could use your input here if you're open to it. My rationale is that more often than not, the columns designated as optional in the activity schema spec will be null for the majority of activities for a given activity schema implementation, which is why I have the null_columns set up with the defaults they currently have. I'm almost certain that my assumption is true for link and revenue_impact, but I haven't used anonymous_customer_id in any activity schema setups for myself yet, so I'm not sure how often that column is left null in practice. For the pipeline you're building, what proportion of activities have a non-null anonymous_customer_id and what proportion have a null anonymous_customer_id?

awoehrl commented 10 months ago

Makes sense @bcodell! Our case is probably almost the opposite as we are working with lots of web analytics data. We have a very high number of anonymous customer ids and compared to that almost no customer ids. Also link will be filled in many cases with the URL where the event happened. In numbers, our biggest activity customer__viewed_article would have 10000 unique customer ids associated per day and arround 1.3 million anonymous customer ids. For most events we will have an anonymous id as these are generated via web analytics (Cookie ids). A second stream would be around produced articles. There we have a customer id and no and no anonymous id, but this would only concern around 1000 events per day.

bcodell commented 10 months ago

Ah interesting. Based on your perspective, maybe my intuition is off and a better assumption to make is that if a developer specifies that they want to include any of the optional columns (link, revenue_impact, anonymous_customer_id), then the build_activity macro should assume by default that the developer will explicitly specify each of those optional columns. In that case, I'd change the default null_columns argument in the build_activity macro to be none, derive which columns to look for based on the stream config, and for specific activities that don't use optional columns, the user can specify the columns that aren't used in the null_columns argument. Does that seem like a reasonable change?

If not, an alternative is to keep the null_columns argument in the build_activity macro, and allow users to override the defaults - likely as a config option in the stream config in dbt_project.yml.

I'm leaning towards the first approach I described - it's probably best to keep deviations from the standard configuration to be more explicit. But let me know what you think!

bcodell commented 10 months ago

Notes from conversation: