Open aballiet opened 1 year ago
Don't know if this issue still relevant.
People can just define additional sync setup with full-refresh and trigger it from an orchestrator like Dagster / Kestra / Ariflow.
This how we do it now at my company at least
Hi @aballiet - thanks for your feedback!
This is indeed still relevant as we do see users see drift naturally over time. We are planning to explore the impact of this feature for our platform so that using an external orchestrator is not necessary. Your point about continuing to pull incremental updates in parallel is very interesting. We'll explore if that is technically feasible!
Don't know if this issue still relevant.
People can just define additional sync setup with full-refresh and trigger it from an orchestrator like Dagster / Kestra / Ariflow.
This how we do it now at my company at least
@aballiet @nataliekwong following this approach we ended up having twice the no of records. Exact steps I followed:
so I'm wondering If I'm doing something wrong? Also @nataliekwong can you please let me know if this feature going to be released any time soon?
Hey @sakent , we have not prioritized this specific feature yet - we're looking at prioritizing it later this year! I do want to mention that in the short-term, we are working on releasing a new Refresh
feature that pulls in all historical data. While still manual, it will allow you to get all historical data from your source without dropping the final data in your destination.
You can follow along on the release of that feature here: https://github.com/airbytehq/airbyte/issues/36295
I wanted to chime in here to suggest using a configurable lookback window for the incremental sync instead of a full refresh (I use a clone of the hs standard connector that re-pulls the last 24 hours of updates, which seems to do the trick).
What area the feature impact?
Connectors
Revelant Information
Currently when an incremental stream connection is setup, there is no easy way to launch a full refresh.
Why do we need full refresh for incremental sync
When connector pull a lot of data (millions of records) for a given stream incrementally, we can observe inconsistencies over time.
Products like Fivetran also have some : for example at my company, we noticed wrong data coming from Fivetran for some HubSpot properties for few objects after a long time running incremental updates.
-> That is why Fivetran (or other ELT providers) offer users to run full refresh using re-sync feature at some point to refresh whole data monthly for example.
How this feature could be implemented
Be able to schedule a CRON job to set the
streamState
to thestart_date
of the source. -> This way connection would pull all data over again sincestart_date
defined when we setup the source.As running full-refresh may take a lot of time depending on the amount of data to pull, we might want to continue pulling incremental updates in parallel to maintain up to date source in production -> In that case the feature would be more interesting than the one offered by Fivetran as it pauses incremental sync with their product