alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.02k stars 269 forks source link

Implement rate limits on writes to blob storage #3882

Open sunu opened 3 years ago

sunu commented 3 years ago

While processing large PST files and other archives on Aleph. we hit the GCS rate limit some times while writing files to our storage bucket.

We should enforce a configurable rate limit when writing files to the archive.

Rosencrantz commented 3 years ago

Are there other options here. For example, could we parallelise transactions with multiple accounts, or increase the GCS limit somehow? Not suggesting that these are feasible or better solutions, just want to understand what other options we might have?

sunu commented 3 years ago

I read the docs and it seems GCS adjusts rate limits automatically based on usage https://cloud.google.com/storage/docs/request-rate#auto-scaling. I think we can get away with retries with exponential backoffs instead of a hard rate limit

stchris commented 1 year ago

The best way forward here is to let the google cloud Python sdk handle retries with an exponential backoff and jitter (see https://cloud.google.com/python/docs/reference/storage/latest/retry_timeout). We would replace our for loop (https://github.com/alephdata/servicelayer/blob/ceb20c34ce141796c46585247cb88607299f3d1c/servicelayer/archive/gs.py#L97) and linear backoff (https://github.com/alephdata/servicelayer/blob/ceb20c34ce141796c46585247cb88607299f3d1c/servicelayer/archive/gs.py#L109C3-L109C3) with that.

See also https://occrp.sentry.io/issues/4162555422/?project=4504916166967297&query=is%3Aunresolved&referrer=issue-stream&stream_index=0

tillprochaska commented 1 year ago

Helpful context from @brrttwrks: We had similar issues in the past when uploading archive/package files (ZIP archives, Outlook PST files, …). During ingestion, these files are unpacked and uploaded to the storage backend individually.