airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.8k stars 4.05k forks source link

[source-google-sheets] increase `row_batch_size` or make it as an input parameter #35274

Open tautvydas-v opened 7 months ago

tautvydas-v commented 7 months ago

Connector Name

source-google-sheets

Connector Version

0.3.16

What step the error happened?

During the sync

Relevant information

Currently, source-google-sheets has a default value of 200 for "row_batch_size" variable. We've noticed that we can easily increase this value in order to process more data with one request. Google Sheets API has a limit of 300 requests per project or 60 requests per user per project, and the only limitation for a request is that it has to be processed in under 180 seconds. Otherwise, if a google sheet has a lot of rows, there is a possibility that at some point exponential backoff fails and the whole sync silently fails too.

We've tested out with having this value as 100, 10000 and 150000 and it seems that this connector works the same way, but processes a lot more data. Also it's understandable that maybe someone would like to have this value lower / higher, so suggestion is to have this value as a parameter, which could be set before setting up connector. I'm happy to contribute to this new feature for source-google-sheets.

Relevant log output

2024-01-16 09:48:59 source > Fetching range Source/Medium!30079:30349
2024-01-16 09:48:59 source > Fetching range Source/Medium!30350:30620
2024-01-16 09:48:59 source > Fetching range Source/Medium!30621:30891
2024-01-16 09:49:00 source > Fetching range Source/Medium!30892:31162
2024-01-16 09:49:00 source > Backing off get_values(...) for 0.5s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:49:00 source > Increasing number of records fetching due to rate limits. Current value: 280
2024-01-16 09:49:00 source > Fetching range Source/Medium!30892:31172
2024-01-16 09:49:00 source > Backing off get_values(...) for 1.7s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:49:00 source > Increasing number of records fetching due to rate limits. Current value: 290
2024-01-16 09:49:02 source > Fetching range Source/Medium!30892:31182
2024-01-16 09:49:02 source > Backing off get_values(...) for 3.7s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:49:02 source > Increasing number of records fetching due to rate limits. Current value: 300
2024-01-16 09:49:06 source > Fetching range Source/Medium!30892:31192
2024-01-16 09:49:06 source > Backing off get_values(...) for 0.9s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:49:06 source > Increasing number of records fetching due to rate limits. Current value: 310
2024-01-16 09:49:07 source > Fetching range Source/Medium!30892:31202
2024-01-16 09:49:07 source > Backing off get_values(...) for 10.8s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:49:07 source > Increasing number of records fetching due to rate limits. Current value: 320
2024-01-16 09:49:18 source > Fetching range Source/Medium!30892:31212
2024-01-16 09:49:18 source > Backing off get_values(...) for 8.7s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:49:18 source > Increasing number of records fetching due to rate limits. Current value: 330
2024-01-16 09:49:27 source > Fetching range Source/Medium!30892:31222
2024-01-16 09:49:27 source > Backing off get_values(...) for 45.5s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:49:27 source > Increasing number of records fetching due to rate limits. Current value: 340
2024-01-16 09:49:32 destination > 2024-01-16 09:49:32 INFO i.a.c.i.d.b.BufferManager(printQueueInfo):118 - [ASYNC QUEUE INFO] Global: max: 4.69 GB, allocated: 20 MB (20.0 MB), % used: 0.004163890739672343 | Queue `Sheet1`, num records: 30884, num bytes: 8.62 MB, allocated bytes: 10 MB | State Manager memory usage: Allocated: 10 MB, Used: 0 bytes, percentage Used 0.000000
2024-01-16 09:49:32 destination > 2024-01-16 09:49:32 INFO i.a.c.i.d.FlushWorkers(printWorkerInfo):146 - [ASYNC WORKER INFO] Pool queue size: 0, Active threads: 0
2024-01-16 09:50:13 source > Fetching range Source/Medium!30892:31232
2024-01-16 09:50:13 source > Backing off get_values(...) for 47.2s (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:50:13 source > Increasing number of records fetching due to rate limits. Current value: 350
2024-01-16 09:50:32 destination > 2024-01-16 09:50:32 INFO i.a.c.i.d.b.BufferManager(printQueueInfo):118 - [ASYNC QUEUE INFO] Global: max: 4.69 GB, allocated: 20 MB (20.0 MB), % used: 0.004163890739672343 | Queue `Sheet1`, num records: 30884, num bytes: 8.62 MB, allocated bytes: 10 MB | State Manager memory usage: Allocated: 10 MB, Used: 0 bytes, percentage Used 0.000000
2024-01-16 09:50:32 destination > 2024-01-16 09:50:32 INFO i.a.c.i.d.FlushWorkers(printWorkerInfo):146 - [ASYNC WORKER INFO] Pool queue size: 0, Active threads: 0
2024-01-16 09:51:00 source > Fetching range Source/Medium!30892:31242
2024-01-16 09:51:00 source > Giving up get_values(...) after 9 tries (googleapiclient.errors.HttpError: <httperror>)
2024-01-16 09:51:00 source > Stopped syncing process due to rate limits. Rate limit has been reached. Please try later or request a higher quota for your account.

Contribute

anjshrg commented 7 months ago

This is already available in connector version 0.3.15 as optional field:

Screenshot 2024-02-14 at 11 40 25 PM
tautvydas-v commented 7 months ago

Is it though? Am I missing something? Because we have google sheets 0.3.16 connector version and the only optional field is about converting columns names to be SQL compliant:

Screenshot 2024-02-14 at 8 17 08 PM

I don't see any kind of changes in either 0.3.15 or 0.3.16 source versions regarding row batch size, and source code also has 200 as hardcoded value. Please correct me if I'm wrong though!

anjshrg commented 7 months ago

My bad! you are right. This is available in version 0.2.17, not sure why it has been removed tough.

tautvydas-v commented 7 months ago

Good catch, thanks! Didn't even think about looking into the older versions. Will take a look tomorrow as to why it was removed, maybe there was some sort of justification

marcosmarxm commented 7 months ago

The parameter was removed and in place the connector uses a default value instead and increase when there is an exception.

tautvydas-v commented 7 months ago

Yes, and what I want to achieve is to make it an input parameter again, since at least in our case, we have quite a lot of google sheets and some of them are very large. Like I described, we had hardcoded batch_row_size as a much higher value, and it had less retries and was faster, so not sure why this value should be hardcoded.