fivetran / dbt_shopify_holistic_reporting

Fivetran's Shopify Holistic Reporting dbt package
https://fivetran.github.io/dbt_shopify_holistic_reporting
Apache License 2.0
6 stars 7 forks source link

[Bug] Duplicate Rows in shopify_holistic_reporting__daily_customer_metrics #14

Closed jmussitsch closed 1 year ago

jmussitsch commented 1 year ago

Is there an existing issue for this?

Describe the issue

In certain scenarios the join/coalesce logic in the combine_histories CTE is resulting in duplicated data. Consider the following (I have removed aggregate columns for brevity):

The result set from the klaviyo_daily CTE contains the following rows:

DATE_DAY   EMAIL                  LAST_TOUCH_CAMPAIGN_ID LAST_TOUCH_FLOW_ID CAMPAIGN_NAME FLOW_NAME               VARIATION_ID CAMPAIGN_SUBJECT_LINE CAMPAIGN_TYPE SOURCE_RELATION SUM_REVENUE_PLACED_ORDER COUNT_PLACED_ORDER COUNT_OPENED_EMAIL
---------- ---------------------- ---------------------- ------------------ ------------- ----------------------- ------------ --------------------- ------------- --------------- ------------------------ ------------------ ------------------
2023-04-12 joe.smith@gmail.com    (null)                 xyz123             (null)        Abandoned Checkout      (null)       (null)                (null)                        58                       1                  3
2023-04-12 joe.smith@gmail.com    (null)                 abc789             (null)        YEB - New Customer Flow (null)       (null)                (null)                        0                        0                  1     

The result set from the shopify_daily CTE contains:

DATE_DAY   EMAIL                  LAST_TOUCH_CAMPAIGN_ID LAST_TOUCH_FLOW_ID LAST_TOUCH_CAMPAIGN_NAME LAST_TOUCH_FLOW_NAME TOTAL_ORDERS TOTAL_PRICE
---------- ---------------------- ---------------------- ------------------ ------------------------ -------------------- ------------ -----------
2023-04-12 joe.smith@gmail.com    (null)                 xyz123             (null)                   Abandoned Checkout   1            57.86

Because the shopify row has a flow id and the coalesce gives precedence to shopify we end up with two rows both with the shopify order related flow even though the other flow from klaviyo has no order:

DATE_DAY   EMAIL                  CAMPAIGN_ID FLOW_ID CAMPAIGN_NAME FLOW_NAME          shopify_TOTAL_ORDERS shopify_TOTAL_PRICE klaviyo_SUM_REVENUE_PLACED_ORDER klaviyo_COUNT_PLACED_ORDER klaviyo_COUNT_OPENED_EMAIL
---------- ---------------------- ----------- ------- ------------- ------------------ -------------------- ------------------- -------------------------------- -------------------------- --------------------------
2023-04-12 joe.smith@gmail.com    (null)      xyz123  (null)        Abandoned Checkout 1                    57.86               58                               1                          3
2023-04-12 joe.smith@gmail.com    (null)      xyz123  (null)        Abandoned Checkout 1                    57.86               0                                0                          1

Relevant error log or model output

No response

Expected behavior

I would expect 2 rows but having different flows:

DATE_DAY   EMAIL                  CAMPAIGN_ID FLOW_ID CAMPAIGN_NAME FLOW_NAME                shopify_TOTAL_ORDERS shopify_TOTAL_PRICE klaviyo_SUM_REVENUE_PLACED_ORDER klaviyo_COUNT_PLACED_ORDER klaviyo_COUNT_OPENED_EMAIL
---------- ---------------------- ----------- ------- ------------- ------------------       -------------------- ------------------- -------------------------------- -------------------------- --------------------------
2023-04-12 joe.smith@gmail.com    (null)      xyz123  (null)        Abandoned Checkout       1                    57.86               58                               1                          3
2023-04-12 joe.smith@gmail.com    (null)      abc789  (null)        YEB - New Customer Flow  0                    0                   0                                0                          1

dbt Project configurations

name: 'chabi'
version: '1.0.0'
config-version: 2
profile: "{{ env_var('DBT_PROFILE', '') }}"
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"  
clean-targets:        
  - "target"
  - "dbt_packages"
models:
  klaviyo:
    +schema:
    intermediate:
      +schema: stg
  klaviyo_source:
    +schema: stg
  shopify:
    +schema:
  shopify_source:
    +schema: stg
  shopify_holistic_reporting:
    +schema:
    intermediate:
      +schema: stg

vars:
  klaviyo_schema: klaviyo
  shopify_schema: shopify

Package versions

packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]
  - package: dbt-labs/codegen
    version: 0.9.0
  - package: fivetran/klaviyo
    version: 0.5.0
  - package: fivetran/shopify
    version: 0.7.0
  - package: fivetran/shopify_holistic_reporting
    version: 0.3.0

What database are you using dbt with?

snowflake

dbt Version

Core:
  - installed: 1.4.1
  - latest:    1.4.5 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - snowflake: 1.4.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-jamie commented 1 year ago

hey @jmussitsch 👋 thanks for explaining the issue so thoroughly! it's clear that we need to add some logic to de-dupe these instances

in your example, would you expect aggregate columns for flow xyz123 on this day to be summed up together across the two platforms? or would you elect to choose shopify vs klaviyo aggregates specifically?

i'm thinking we need to add a hefty and dynamic group by after joining the shopify and klaviyo daily CTEs and sum up the aggregate metrics

jmussitsch commented 1 year ago

@fivetran-jamie thanks for looking into this!

I apologize - I just noticed I had mixed up the shopify_daily and klaviyo_daily CTE names above in my original comment. I just fixed it. I also added a few aggregate and aggregate-related columns to the example tables above. Let me know if this makes clear what my expectation would be and answers your question.

fivetran-jamie commented 1 year ago

thanks for updating, that does make things clearer! still trying to recreate the issue in our environment and test this out, but i'm wondering if we just need to add the flows/campaigns/variations to the full outer join between the 2 CTEs, since they are part of the grain of this model

so like

    from shopify_daily
    full outer join klaviyo_daily
        on lower(shopify_daily.email) = lower(klaviyo_daily.email)
        and shopify_daily.date_day = klaviyo_daily.date_day
        and shopify_daily.last_touch_campaign_id = klaviyo_daily.last_touch_campaign_id
        and shopify_daily.last_touch_flow_id = klaviyo_daily.last_touch_flow_id
        and shopify_daily.last_touch_variation_id = klaviyo_daily.variation_id

@jmussitsch if you have time would you mind trying out the following branch? it includes the above join logic

# packages.yml
packages:
  - git: https://github.com/fivetran/dbt_shopify_holistic_reporting.git
    revision: scope/issue-14

i have other ideas if this doesn't work... 🤔

jmussitsch commented 1 year ago

@fivetran-jamie I tried your fix but I think the issue is that wont treat columns where both values are null as equal for the full outer join... since in most databases NULL = NULL is NULL not true. Because of this I get extra rows.

I am on Snowflake and don't author packages for use across databases so not sure how you guys typically handle null safe equality. For quick testing purposes I did this and got the results I would expect:

  and coalesce(shopify_daily.last_touch_campaign_id, '') = coalesce(klaviyo_daily.last_touch_campaign_id, '')
  and coalesce(shopify_daily.last_touch_flow_id, '') = coalesce(klaviyo_daily.last_touch_flow_id, '')
  and coalesce(shopify_daily.last_touch_variation_id, '') = coalesce(klaviyo_daily.variation_id, '')
fivetran-jamie commented 1 year ago

ah yeah we definitely need those coalesces!! BigQuery really doesn't likenull = null either.

i just pushed your fix to the working branch of the package if you have time to re-dbt deps and run again as a sanity check @jmussitsch

jmussitsch commented 1 year ago

@fivetran-jamie I ran this against our data and did a few checks and it looks good

fivetran-jamie commented 1 year ago

great to hear! we'll be pushing a release of the package with this fix in it next sprint (starts tomorrow). there are just some other package updates we should/need to batch with this one

fivetran-jamie commented 1 year ago

hey @jmussitsch we've just released this in v0.4.0! this release also includes some upstream breaking changes from shopify v0.8.* FYI

jmussitsch commented 1 year ago

@fivetran-jamie great thanks!