Open sunu opened 3 years ago
Are there other options here. For example, could we parallelise transactions with multiple accounts, or increase the GCS limit somehow? Not suggesting that these are feasible or better solutions, just want to understand what other options we might have?
I read the docs and it seems GCS adjusts rate limits automatically based on usage https://cloud.google.com/storage/docs/request-rate#auto-scaling. I think we can get away with retries with exponential backoffs instead of a hard rate limit
The best way forward here is to let the google cloud Python sdk handle retries with an exponential backoff and jitter (see https://cloud.google.com/python/docs/reference/storage/latest/retry_timeout). We would replace our for loop (https://github.com/alephdata/servicelayer/blob/ceb20c34ce141796c46585247cb88607299f3d1c/servicelayer/archive/gs.py#L97) and linear backoff (https://github.com/alephdata/servicelayer/blob/ceb20c34ce141796c46585247cb88607299f3d1c/servicelayer/archive/gs.py#L109C3-L109C3) with that.
Helpful context from @brrttwrks: We had similar issues in the past when uploading archive/package files (ZIP archives, Outlook PST files, …). During ingestion, these files are unpacked and uploaded to the storage backend individually.
While processing large PST files and other archives on Aleph. we hit the GCS rate limit some times while writing files to our storage bucket.
We should enforce a configurable rate limit when writing files to the archive.