airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.26k stars 4.15k forks source link

Schedule full refresh for incremental stream to reduce data inconsistencies over time #26513

Open aballiet opened 1 year ago

aballiet commented 1 year ago

What area the feature impact?

Connectors

Revelant Information

Currently when an incremental stream connection is setup, there is no easy way to launch a full refresh.

Why do we need full refresh for incremental sync

When connector pull a lot of data (millions of records) for a given stream incrementally, we can observe inconsistencies over time.

Products like Fivetran also have some : for example at my company, we noticed wrong data coming from Fivetran for some HubSpot properties for few objects after a long time running incremental updates.

-> That is why Fivetran (or other ELT providers) offer users to run full refresh using re-sync feature at some point to refresh whole data monthly for example.

How this feature could be implemented

Be able to schedule a CRON job to set the streamState to the start_date of the source. -> This way connection would pull all data over again since start_date defined when we setup the source.

As running full-refresh may take a lot of time depending on the amount of data to pull, we might want to continue pulling incremental updates in parallel to maintain up to date source in production -> In that case the feature would be more interesting than the one offered by Fivetran as it pauses incremental sync with their product

aballiet commented 1 year ago

Don't know if this issue still relevant.

People can just define additional sync setup with full-refresh and trigger it from an orchestrator like Dagster / Kestra / Ariflow.

This how we do it now at my company at least

nataliekwong commented 1 year ago

Hi @aballiet - thanks for your feedback!

This is indeed still relevant as we do see users see drift naturally over time. We are planning to explore the impact of this feature for our platform so that using an external orchestrator is not necessary. Your point about continuing to pull incremental updates in parallel is very interesting. We'll explore if that is technically feasible!

sakent commented 5 months ago

Don't know if this issue still relevant.

People can just define additional sync setup with full-refresh and trigger it from an orchestrator like Dagster / Kestra / Ariflow.

This how we do it now at my company at least

@aballiet @nataliekwong following this approach we ended up having twice the no of records. Exact steps I followed:

  1. dropped final table, its corresponding _scd table and raw table.
  2. created a new connection with full refresh and triggered sync
  3. this new connection brought in all the data as expected
  4. however when the old connection with incremental sync run is triggered and completed, twice the no of expected records were present

so I'm wondering If I'm doing something wrong? Also @nataliekwong can you please let me know if this feature going to be released any time soon?

nataliekwong commented 5 months ago

Hey @sakent , we have not prioritized this specific feature yet - we're looking at prioritizing it later this year! I do want to mention that in the short-term, we are working on releasing a new Refresh feature that pulls in all historical data. While still manual, it will allow you to get all historical data from your source without dropping the final data in your destination.

You can follow along on the release of that feature here: https://github.com/airbytehq/airbyte/issues/36295

ehearty commented 3 months ago

I wanted to chime in here to suggest using a configurable lookback window for the incremental sync instead of a full refresh (I use a clone of the hs standard connector that re-pulls the last 24 hours of updates, which seems to do the trick).