airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.12k stars 4.12k forks source link

Source TikTok Marketing: Incorrect PK resulting in deleted data #23728

Closed MatthieuColinBM closed 1 year ago

MatthieuColinBM commented 1 year ago

Environment

Current Behavior

When using the TikTok marketing Source, for every <object>_daily_reports using Incremental | Dedup + History mode we only get one line x <object>_id on the denormalized table. ex. On the stream ad_groups_reports_daily, the PK configured is adgroup_id. The denormalization process creates a table called ad_groups_reports_daily. The resulting data, after a few days of run is 1 line per adgroup_id The result is to have only partial data, the last available point of data for a given adgroup instead of all of the data points available for this adgroup.

This behavior is the same for every layer on the TikTok data (ie. advertisers, campaigns, adgroups, and ads) for every xxx_report_daily. We did not had the opportunity to test the xxx_report_hourly and xxx_report_lifetime and neither the audience reports.

Expected Behavior

We would think to have 1 line x <object> x stat_time_day

ex. On the same scenario, we would think to have 1 line per adgroup_id x stat_time_day

Possible Solution

One possible solution could be to add the stat_time_day (a composite one) as a primary key that will be used on the dedup process. (ie. <object>_id and stat_time_day as a PK) Here is an exemple of the BingAds Connector that have composite PK for report streams wich should have the exact same behavior (almost all media platform are working the same way, with a X day window where the data can be updated) : image

Logs

-

Steps to Reproduce

  1. Create a Connection using TikTok marketing as source and BQ as a destination
  2. Enable the ad_groups_reports_daily stream and put it in Incremental dedup + history mode.
  3. Run the connection for at least 2 days
  4. Take a look at the resulting data (denormalized one).

Are you willing to submit a PR?

I don't feel I would be able to 😄

rach-r commented 1 year ago

Also mentioned in the Slack channel here

grubberr commented 1 year ago

This problem was solved in this PR https://github.com/airbytehq/airbyte/pull/24630