MuckRock / documentcloud

DocumentCloud's back end source code - Please report bugs, issues and feature requests to info@documentcloud.org
https://www.documentcloud.org
GNU Affero General Public License v3.0
32 stars 8 forks source link

503: Request slowdown using Python wrapper #222

Open duckduckgrayduck opened 3 months ago

duckduckgrayduck commented 3 months ago

i'm receiving the following: documentcloud.exceptions.APIError: 503 - <?xml version="1.0" encoding="UTF-8"?>

SlowDownPlease reduce your request rate.48HHKKE4ZVM2HX5E3cDjq5OKjt3YZM91t21VRKRbDTr89/lUkzrJMSphkBjge369inHIAVDNiNzuGnEiDtsQGUy+RQw=

https://github.com/MuckRock/documentcloud-regex-addon/actions/runs/9066429901/job/24909314442

when using page_text = document.get_page_text(page_number) in the Regex Extractor Add-On

mitchelljkotler commented 3 months ago

That is coming from S3 directly - I believe the rate limits for S3 are not concrete, and they rate limit you as they see fit. We could put some exponential backoff into the python library.

eyeseast commented 3 months ago

It's doing this one page at a time instead of getting all the text at once: https://github.com/MuckRock/documentcloud-regex-addon/blob/main/main.py#L34