Closed wallies closed 2 years ago
@wallies how would you expect incremental sync to work with Sheets? Specifically, would you expect that an entire row is re-synced if any cell changes? I'm assuming new rows would be replicated as well.
@sherifnada We currently sync from datasources not supported into google sheets like survicate. I would expect if any row or cell changes or new rows added that this would sync, instead of syncing the entire sheet which could be thousands of rows.
Implementation note: we should use this opportunity to explore moving the connector to use the CDK
Google Sheets API does not have the ability to support full incremental sync.
Incremental SYNC OPTIONS:
(API side) sync only new rows:
https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get
stream_state = {‘row_number’: ‘
Pros:
Cons:
(API side) sync with filter based on user-specified cursor_field (column): https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/batchGetByDataFilter stream_state = {‘user_specified_cursor_field‘: ‘<max_value_in_cursor_column’}
Pros:
Cons:
(Client side) directly compare file revisions: https://developers.google.com/drive/api/v3/reference/revisions/list
stream_state = {‘file_revision’: ‘
Pros:
Cons:
@wallies do you have any preferences or feedback on the options above? I think we're leaning towards option 1 but it still comes with asterisks i.e: if you rearrange rows for any reason then you might incur data loss.
@sherifnada I was thinking Option 1 would be a good place to start. Although thinking about it more, what I was actually thinking when raising this was more Option 2, as we have partials in google sheets, that get updated, so would be better to sync on a modified date, which is a column.
investigation outcome: https://docs.google.com/document/d/1-uOdlcg1WBpfQY31XbGRBasGMZO_B2dZet_H_f-Drv0/edit
Changed status to "on Hold" becuase the implementation approach is not defined.
We're closing this issue as the feasibility study above indicated it's not possible to implement incremental syncs reliably. Given the scale of a typical spreadsheet, full refresh syncs are usually fine to pick up new records/deletes.
Tell us about the new connector you’d like to have
Which source and which destination? Google Sheets Incremental Sync is documented as coming soon. Do we have a date on this? With Incremental sync or full sync is it possible to create new files in S3 based on volume or time windows
Do you need a specific version of the underlying data source e.g: you specifically need support for an older version of the API or DB? No
Describe the context around this new connector
Which team in your company wants this integration, what for? This helps us understand the use case. Data Integration team
How often do you want to run syncs? Full syncs and incremental syncs happen based on time windows or volume
If this is an API source connector, which entities/endpoints do you need supported?
Describe the alternative you are considering or using
What are you considering doing if you don’t have this integration through Airbyte? Looking at rudderstack