Closed jmussitsch closed 1 year ago
hey @jmussitsch 👋 thanks for explaining the issue so thoroughly! it's clear that we need to add some logic to de-dupe these instances
in your example, would you expect aggregate columns for flow xyz123
on this day to be summed up together across the two platforms? or would you elect to choose shopify vs klaviyo aggregates specifically?
i'm thinking we need to add a hefty and dynamic group by
after joining the shopify and klaviyo daily CTEs and sum up the aggregate metrics
@fivetran-jamie thanks for looking into this!
I apologize - I just noticed I had mixed up the shopify_daily and klaviyo_daily CTE names above in my original comment. I just fixed it. I also added a few aggregate and aggregate-related columns to the example tables above. Let me know if this makes clear what my expectation would be and answers your question.
thanks for updating, that does make things clearer! still trying to recreate the issue in our environment and test this out, but i'm wondering if we just need to add the flows/campaigns/variations to the full outer join
between the 2 CTEs, since they are part of the grain of this model
so like
from shopify_daily
full outer join klaviyo_daily
on lower(shopify_daily.email) = lower(klaviyo_daily.email)
and shopify_daily.date_day = klaviyo_daily.date_day
and shopify_daily.last_touch_campaign_id = klaviyo_daily.last_touch_campaign_id
and shopify_daily.last_touch_flow_id = klaviyo_daily.last_touch_flow_id
and shopify_daily.last_touch_variation_id = klaviyo_daily.variation_id
@jmussitsch if you have time would you mind trying out the following branch? it includes the above join logic
# packages.yml
packages:
- git: https://github.com/fivetran/dbt_shopify_holistic_reporting.git
revision: scope/issue-14
i have other ideas if this doesn't work... 🤔
@fivetran-jamie I tried your fix but I think the issue is that wont treat columns where both values are null as equal for the full outer join... since in most databases NULL = NULL
is NULL
not true. Because of this I get extra rows.
I am on Snowflake and don't author packages for use across databases so not sure how you guys typically handle null safe equality. For quick testing purposes I did this and got the results I would expect:
and coalesce(shopify_daily.last_touch_campaign_id, '') = coalesce(klaviyo_daily.last_touch_campaign_id, '')
and coalesce(shopify_daily.last_touch_flow_id, '') = coalesce(klaviyo_daily.last_touch_flow_id, '')
and coalesce(shopify_daily.last_touch_variation_id, '') = coalesce(klaviyo_daily.variation_id, '')
ah yeah we definitely need those coalesces!! BigQuery really doesn't likenull = null
either.
i just pushed your fix to the working branch of the package if you have time to re-dbt deps
and run again as a sanity check @jmussitsch
@fivetran-jamie I ran this against our data and did a few checks and it looks good
great to hear! we'll be pushing a release of the package with this fix in it next sprint (starts tomorrow). there are just some other package updates we should/need to batch with this one
@fivetran-jamie great thanks!
Is there an existing issue for this?
Describe the issue
In certain scenarios the join/coalesce logic in the
combine_histories
CTE is resulting in duplicated data. Consider the following (I have removed aggregate columns for brevity):The result set from the
klaviyo_daily
CTE contains the following rows:The result set from the
shopify_daily
CTE contains:Because the shopify row has a flow id and the coalesce gives precedence to shopify we end up with two rows both with the shopify order related flow even though the other flow from klaviyo has no order:
dbt_utils_unique_combination_of_columns_shopify_holistic_reporting__daily_customer_metrics_date_day__email__klaviyo_source_relation__shopify_source_relation__campaign_id__flow_id__variation_id
test failsRelevant error log or model output
No response
Expected behavior
I would expect 2 rows but having different flows:
dbt Project configurations
Package versions
What database are you using dbt with?
snowflake
dbt Version
Additional Context
No response
Are you willing to open a PR to help address this issue?