Open casparwylie opened 1 year ago
Hi @casparwylie thank you for raising this issue. Did Fivetran support say why these duplicates are not a mistake?
I also was looking at a past issue #12 and saw another customer experienced a similar error due to the territory_name being slightly different. Would you be able to share a few of these duplicate records so we may see if they have some differences that we may be able to correct within the package.
If they do not differ, I would encourage reaching out to Apple to understand why there are duplicates in the source data. I do not believe there should be duplicate subscription report entries in the raw data. That seems like a data integrity issue that this failed test is appropriately flagging.
To clarify, the rows are still unique by the fivetran/meta fields, just often not by date_day, account_id, app_id, subscription_name, territory_long, state
.
We are seeing plenty of duplicate rows where every column except the meta columns (e.g _index
) are the same in sales_subscription_event_summary
.
@fivetran-markgaughran am I right in saying from your end, the duplicates are expected?
Hi @fivetran-joemarkiewicz @casparwylie ,
duplicates do not exist in the SALES_SUBSCRIPTION_EVENT_SUMMARY
table for the Fivetran assigned Primary keys but they do appear to exist for the transformation output unique keys (date_day, account_id, app_id, subscription_name, territory_long, state), thus causing the transformation to fail.
Thanks for adding context @casparwylie and @fivetran-markgaughran.
@casparwylie would you be able to share an example of a duplicate in the apple_store__subscription_report
? Mainly I would be curious to take a look at the territory_long
field as this has caused some issues in the past with Apple not being consistent with territory naming. An example of the duplicate record will help us understand what next steps may be needed to resolve the error.
I'm not sure why, but the tests are now passing, likely due to a new historic sync. I now can't find examples other than what I described above! I'm closing the issue. Thank you both.
Apologies but the issue as resurfaced now. Here is are 2 fresh examples in JSON result format given the query
SELECT date_day, account_id, app_id, subscription_name, territory_long, state, count(*) as qty
FROM `project.apple_store.apple_store__subscription_report`
GROUP BY date_day, account_id, app_id, subscription_name, territory_long, state
HAVING count(*)> 1
(in total there are 211183
results)
[{
"date_day": "2022-07-21",
"account_id": "<our account id>",
"app_id": null,
"subscription_name": "Offer name",
"territory_long": "Armenia",
"state": null,
"qty": "2"
}, {
"date_day": "2022-11-10",
"account_id": "<our account id>",
"app_id": null,
"subscription_name": "Offer name",
"territory_long": "Armenia",
"state": null,
"qty": "2"
}]
Let me know your thoughts. Thank you.
@fivetran-joemarkiewicz Hey - just wondering if any updates on this! Thanks.
Hi @casparwylie I am sorry to see that the issue has resurfaced. Would you be able to share the select *
of one of those duplicates you came across? I am wondering if this is in fact a duplicate issue that needs to be traced back to the source or code logic in the package, or if this is a scenario where we simply need to update our tests to factor in more than the specified fields for uniqueness.
I've included the query that fetched all the rows (and hidden some more sensitive properties) in the previous comment. Is there any column(s) in particular you'd be keen to see?
Yeah I am wondering if there are any columns where you saw the rows were not unique? If they are sensitive no need to share, but I am curious if rows were duplicates across every single field?
Additionally, it would be worthwhile to check the source again and make sure these duplicates don't exist there.
Yea so we are seeing plenty of duplicate rows where every column except the meta columns (e.g _index) are the same in sales_subscription_event_summary
. However, unrelated, it looks like app_name
is the only difference in the duplicates in apple_store__subscription_report
. I suppose if the app_name
changes in the app store, the report here causes duplicates.
@casparwylie thank you for sharing! The insight into the app_name
duplicates does make sense and probably is something we should update in our test to account for the name of the app as that may change.
However, I am still struggling with the duplicates in the source that are only not duplicates due to the Fivetran metadata columns. Would you be interested in meeting sometime this week for my team and I to review these live with you and determine the best approach forward?
My team aren't going to be looking at this anymore so probably not neccassary - the first issue mentioned is probably the main one though in case you're keen to look into it further! Thanks anyway.
Is there an existing issue for this?
Describe the issue
The test case
unique_combination_of_columns
in theapple_store__subscription_report
table fails because the data does seem to have duplicates (where multiple rows have the samedate_day
,account_id
,app_id
,subscription_name
,territory_long
,state
).Looking into it further, the duplicates seem to exist in the transformed data from
app_store.sales_subscription_event_summary
(though the Fivetran assigned Primary keys are still all distinct).I raised a ticket with Fivetran support who have said the duplicates are not a mistake, meaning I believe the test in the DBT package should change.
Relevant error log or model output
No response
Expected behavior
Tests should pass regardless of duplicates for the listed fields.
dbt Project configurations
N/A
Package versions
0.3.1
What database are you using dbt with?
bigquery
dbt Version
1.5.0
Additional Context
No response
Are you willing to open a PR to help address this issue?