airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.48k stars 3.99k forks source link

[source-google-sheets] All records are deleted if the google sheets api returns an unknown error #45106

Open Reidsy opened 1 week ago

Reidsy commented 1 week ago

Connector Name

source-google-sheets

Connector Version

0.7.1

What step the error happened?

During the sync

Relevant information

When syncing with google sheets, if the google sheets api returns an error that is unknown (not FORBIDDEN or TOO_MANY_REQUESTS), an error is logged but an exception is not raised. This causes airbyte to consider the sheet having zero rows of data and delete all the rows in the destination table.

Problematic code: https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-google-sheets/source_google_sheets/source.py#L261

Expectation: Airbyte should fail the job and try again.

What actually happens: Airbyte updates the table to have zero rows screenshot-airbyte-zero-rows

Solution: The code should raise an AirbyteTracedException for any api calls that result in an unsuccessful status code. https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-google-sheets/source_google_sheets/source.py#L261

Relevant log output

2024-08-22 17:02:42 source > Backing off get(...) for 0.0s (googleapiclient.errors.HttpError: <httperror>)
2024-08-22 17:02:42 source > Giving up get(...) after 2 tries (googleapiclient.errors.HttpError: <httperror>)
2024-08-22 17:02:42 source > 500: Internal error encountered.. There was an issue with the Google Sheets API. This is usually a temporary issue from Google's side. Please try again. If this issue persists, contact support
2024-08-22 17:02:42 source > Finished syncing spreadsheet <sheet-id-redacted>
2024-08-22 17:02:43 platform > (pod: airbyte / source-google-sheets-read-1887-0-iyrkz) - Closed all resources for pod
2024-08-22 17:02:43 platform > Total records read: 0 (0 bytes)
2024-08-22 17:02:43 platform > Schema validation was performed to a max of 10 records with errors per stream.
2024-08-22 17:02:43 platform > readFromSource: done. (source.isFinished:true, fromSource.isClosed:false)
2024-08-22 17:02:43 platform > processMessage: done. (fromSource.isDone:true, forDest.isClosed:false)
2024-08-22 17:02:43 platform > thread status... heartbeat thread: false , replication thread: true
2024-08-22 17:02:43 platform > writeToDestination: done. (forDest.isDone:true, isDestRunning:true)
2024-08-22 17:02:43 platform > thread status... timeout thread: false , replication thread: true

Contribute

Reidsy commented 1 week ago

I've created a pull request to address this https://github.com/airbytehq/airbyte/pull/45108

marcosmarxm commented 1 week ago

Thanks a lot @Reidsy! An engineer from the team was assigned to review your contribution. Let me know if you need any assistante.