fivetran / dbt_apple_store

Fivetran's Apple App Store source dbt package
https://fivetran.github.io/dbt_apple_store/
Apache License 2.0
0 stars 2 forks source link

Tests fail due to apparent duplicate rows in apple_store__subscription_report #15

Open casparwylie opened 1 year ago

casparwylie commented 1 year ago

Is there an existing issue for this?

Describe the issue

The test case unique_combination_of_columns in the apple_store__subscription_report table fails because the data does seem to have duplicates (where multiple rows have the same date_day, account_id, app_id, subscription_name, territory_long, state).

Looking into it further, the duplicates seem to exist in the transformed data from app_store.sales_subscription_event_summary (though the Fivetran assigned Primary keys are still all distinct).

I raised a ticket with Fivetran support who have said the duplicates are not a mistake, meaning I believe the test in the DBT package should change.

Relevant error log or model output

No response

Expected behavior

Tests should pass regardless of duplicates for the listed fields.

dbt Project configurations

N/A

Package versions

0.3.1

What database are you using dbt with?

bigquery

dbt Version

1.5.0

Additional Context

No response

Are you willing to open a PR to help address this issue?

fivetran-joemarkiewicz commented 1 year ago

Hi @casparwylie thank you for raising this issue. Did Fivetran support say why these duplicates are not a mistake?

I also was looking at a past issue #12 and saw another customer experienced a similar error due to the territory_name being slightly different. Would you be able to share a few of these duplicate records so we may see if they have some differences that we may be able to correct within the package.

If they do not differ, I would encourage reaching out to Apple to understand why there are duplicates in the source data. I do not believe there should be duplicate subscription report entries in the raw data. That seems like a data integrity issue that this failed test is appropriately flagging.

casparwylie commented 1 year ago

To clarify, the rows are still unique by the fivetran/meta fields, just often not by date_day, account_id, app_id, subscription_name, territory_long, state.

We are seeing plenty of duplicate rows where every column except the meta columns (e.g _index) are the same in sales_subscription_event_summary.

@fivetran-markgaughran am I right in saying from your end, the duplicates are expected?

fivetran-markgaughran commented 1 year ago

Hi @fivetran-joemarkiewicz @casparwylie ,

duplicates do not exist in the SALES_SUBSCRIPTION_EVENT_SUMMARY table for the Fivetran assigned Primary keys but they do appear to exist for the transformation output unique keys (date_day, account_id, app_id, subscription_name, territory_long, state), thus causing the transformation to fail.

fivetran-joemarkiewicz commented 1 year ago

Thanks for adding context @casparwylie and @fivetran-markgaughran.

@casparwylie would you be able to share an example of a duplicate in the apple_store__subscription_report? Mainly I would be curious to take a look at the territory_long field as this has caused some issues in the past with Apple not being consistent with territory naming. An example of the duplicate record will help us understand what next steps may be needed to resolve the error.

casparwylie commented 1 year ago

I'm not sure why, but the tests are now passing, likely due to a new historic sync. I now can't find examples other than what I described above! I'm closing the issue. Thank you both.

casparwylie commented 1 year ago

Apologies but the issue as resurfaced now. Here is are 2 fresh examples in JSON result format given the query

SELECT date_day, account_id, app_id, subscription_name, territory_long, state, count(*) as qty 
 FROM `project.apple_store.apple_store__subscription_report` 
 GROUP BY date_day, account_id, app_id, subscription_name, territory_long, state
 HAVING count(*)> 1

(in total there are 211183 results)

[{
    "date_day": "2022-07-21",
    "account_id": "<our account id>",
    "app_id": null,
    "subscription_name": "Offer name",
    "territory_long": "Armenia",
    "state": null,
    "qty": "2"
}, {
    "date_day": "2022-11-10",
    "account_id": "<our account id>",
    "app_id": null,
    "subscription_name": "Offer name",
    "territory_long": "Armenia",
    "state": null,
    "qty": "2"
}]

Let me know your thoughts. Thank you.

casparwylie commented 1 year ago

@fivetran-joemarkiewicz Hey - just wondering if any updates on this! Thanks.

fivetran-joemarkiewicz commented 1 year ago

Hi @casparwylie I am sorry to see that the issue has resurfaced. Would you be able to share the select * of one of those duplicates you came across? I am wondering if this is in fact a duplicate issue that needs to be traced back to the source or code logic in the package, or if this is a scenario where we simply need to update our tests to factor in more than the specified fields for uniqueness.

casparwylie commented 1 year ago

I've included the query that fetched all the rows (and hidden some more sensitive properties) in the previous comment. Is there any column(s) in particular you'd be keen to see?

fivetran-joemarkiewicz commented 1 year ago

Yeah I am wondering if there are any columns where you saw the rows were not unique? If they are sensitive no need to share, but I am curious if rows were duplicates across every single field?

Additionally, it would be worthwhile to check the source again and make sure these duplicates don't exist there.

casparwylie commented 1 year ago

Yea so we are seeing plenty of duplicate rows where every column except the meta columns (e.g _index) are the same in sales_subscription_event_summary. However, unrelated, it looks like app_name is the only difference in the duplicates in apple_store__subscription_report. I suppose if the app_name changes in the app store, the report here causes duplicates.

fivetran-joemarkiewicz commented 1 year ago

@casparwylie thank you for sharing! The insight into the app_name duplicates does make sense and probably is something we should update in our test to account for the name of the app as that may change.

However, I am still struggling with the duplicates in the source that are only not duplicates due to the Fivetran metadata columns. Would you be interested in meeting sometime this week for my team and I to review these live with you and determine the best approach forward?

casparwylie commented 1 year ago

My team aren't going to be looking at this anymore so probably not neccassary - the first issue mentioned is probably the main one though in case you're keen to look into it further! Thanks anyway.