airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.33k stars 4.16k forks source link

[source-mixpanel] event export stream duplicate insert when time is 23:59:59 UTC #45068

Open descampsk opened 3 months ago

descampsk commented 3 months ago

Connector Name

source-mixpanel

Connector Version

v3.4.1

What step the error happened?

During the sync

Relevant information

Mixpanel export stream syncs full days. When we sync it every night around 3am, all events whose time is at 23:59:59UTC is synced twice.

An event at the time 2024-08-30 23:59:59 UTC will be synced:

With a bigquery destination, here is the query to find that:

SELECT
  insert_id,
  time,
  _airbyte_extracted_at
FROM
  `mixpanel_export`
WHERE
  DATE(_airbyte_extracted_at) > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
  AND insert_id IN(
  SELECT
    insert_id
  FROM
    `mixpanel_export`
  WHERE
    DATE(_airbyte_extracted_at) > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    AND DATE(time) = DATE_SUB(DATE(_airbyte_extracted_at), INTERVAL 2 DAY) )
ORDER BY
  insert_id;

image

If you launch the sync multiple times during one day, all these events will be sync every time.

Relevant log output

No response

Contribute

descampsk commented 3 months ago

I suppose it is because of this https://github.com/airbytehq/airbyte/blob/477689b5b799ff4fefc823774918ee22648a8387/airbyte-integrations/connectors/source-mixpanel/source_mixpanel/streams/export.py#L189C1-L191C76

And

Setting state of SourceMixpanel stream to {'time': '2024-08-30T23:59:59Z'}