airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.3k stars 4.15k forks source link

Google Sheets Incremental Sync #3936

Closed wallies closed 2 years ago

wallies commented 3 years ago

Tell us about the new connector you’d like to have

Describe the context around this new connector

Describe the alternative you are considering or using

What are you considering doing if you don’t have this integration through Airbyte? Looking at rudderstack

sherifnada commented 3 years ago

@wallies how would you expect incremental sync to work with Sheets? Specifically, would you expect that an entire row is re-synced if any cell changes? I'm assuming new rows would be replicated as well.

wallies commented 3 years ago

@sherifnada We currently sync from datasources not supported into google sheets like survicate. I would expect if any row or cell changes or new rows added that this would sync, instead of syncing the entire sheet which could be thousands of rows.

sherifnada commented 3 years ago

Implementation note: we should use this opportunity to explore moving the connector to use the CDK

annalvova05 commented 3 years ago

Google Sheets API does not have the ability to support full incremental sync.

Incremental SYNC OPTIONS:

Option 1

(API side) sync only new rows: https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get stream_state = {‘row_number’: ‘’}

Pros:

Cons:

Option 2

(API side) sync with filter based on user-specified cursor_field (column): https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/batchGetByDataFilter stream_state = {‘user_specified_cursor_field‘: ‘<max_value_in_cursor_column’}

Pros:

Cons:

Option 3

(Client side) directly compare file revisions: https://developers.google.com/drive/api/v3/reference/revisions/list

stream_state = {‘file_revision’: ‘’}

Pros:

Cons:

sherifnada commented 3 years ago

@wallies do you have any preferences or feedback on the options above? I think we're leaning towards option 1 but it still comes with asterisks i.e: if you rearrange rows for any reason then you might incur data loss.

wallies commented 3 years ago

@sherifnada I was thinking Option 1 would be a good place to start. Although thinking about it more, what I was actually thinking when raising this was more Option 2, as we have partials in google sheets, that get updated, so would be better to sync on a modified date, which is a column.

sherifnada commented 3 years ago

investigation outcome: https://docs.google.com/document/d/1-uOdlcg1WBpfQY31XbGRBasGMZO_B2dZet_H_f-Drv0/edit

midavadim commented 2 years ago

Changed status to "on Hold" becuase the implementation approach is not defined.

sherifnada commented 2 years ago

We're closing this issue as the feasibility study above indicated it's not possible to implement incremental syncs reliably. Given the scale of a typical spreadsheet, full refresh syncs are usually fine to pick up new records/deletes.