Serene-Arc / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.3k stars 211 forks source link

[FEATURE] Rate limit option for free API users #945

Open ymgenesis opened 8 months ago

ymgenesis commented 8 months ago

Description

An option that will automatically limit bdfr's API requests to a number per minute to work within Reddit's newer free API rate limits of (I think) less than 100 per minute, or averaged over 10 minutes for burst usage (most common example I've seen on reddit is 60 req/min and/or not more than 600 req for 10 min).

Currently I'm doing a sleep of a few minutes or so in-between bdfr commands. However it would be nice to be able to have functionality in bdfr that would maximize its usage while being within the free API limits by tempering its own requests. I understand PRAW handles this? Instead of an execution failing because of prawcore.exceptions.TooManyRequests: received 429 HTTP response, could bdfr analyze the x-ratelimit-remaining, x-ratelimit-reset, x-ratelimit-used responses to wait until the program can proceed without failing?

As far as I can tell the logs don't list the API requests, so it's hard to tell how many are used when executing. I'm also not sure how to check the response headers (x-ratelimit-remaining, x-ratelimit-reset, x-ratelimit-used) when running bdfr for testing or to set my sleep to the time until the ratelimit is reset.

Serene-Arc commented 8 months ago

This is already done by the package we use to interface with the Reddit API, praw, at least according to their documentation. We might need to boost the version installed though.

ymgenesis commented 8 months ago

@Serene-Arc I thought it was, as well. I’m consistently getting 429 responses and the execution fails with an exception. Granted I am doing a lot of calls, but I suppose I expected praw to manage itself instead of failing. I’ll try updating praw. I even set my sleep to 10 minutes between bdfr executions which download 20 submissions each time. It’s better but it still fails often. It’s too hard to guess at the time until rate reset manually. 

EDIT: I updated praw with "praw>=7.7.1", in pyproject.toml, but it still refuses to download after too many requests (obvious enough):

[2024-02-20 09:48:46,662 - bdfr.connector - ERROR] - User god failed to be retrieved due to a PRAW exception: received 429 HTTP response
[2024-02-20 09:48:46,666 - bdfr.connector - DEBUG] - Waiting 60 seconds to continue

Additionally, I still get the exception crash with the updated praw instead of a "clean" 429 like above (not sure if this is related to me using an older approach to the progress bar). Unlike the other 429, the user is retrieved, but I guess the posts hit a 429?

[2024-02-20 10:02:35,014 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2024-02-20 10:02:35,020 - bdfr.connector - DEBUG] - Using authenticated Reddit instance
[2024-02-20 10:02:35,827 - bdfr.connector - DEBUG] - Retrieving submitted posts of user god
[2024-02-20 10:02:39,599 - bdfr.downloader - INFO] - Calculating hashes for 38 files
Traceback (most recent call last):
  File "/Users/me/Documents/GitHub/bulk-downloader-for-reddit/bdfr/downloader.py", line 51, in download
    for submission in tqdm(list(generator), desc=desc, unit="post", leave=False):
                           ^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/praw/models/listing/generator.py", line 63, in __next__
    self._next_batch()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/praw/models/listing/generator.py", line 89, in _next_batch
    self._listing = self._reddit.get(self.url, params=self.params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/praw/util/deprecate_args.py", line 43, in wrapped
    return func(**dict(zip(_old_args, args)), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/praw/reddit.py", line 712, in get
    return self._objectify_request(method="GET", params=params, path=path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/praw/reddit.py", line 517, in _objectify_request
    self.request(
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/praw/util/deprecate_args.py", line 43, in wrapped
    return func(**dict(zip(_old_args, args)), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/praw/reddit.py", line 941, in request
    return self._core.request(
           ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/prawcore/sessions.py", line 330, in request
    return self._request_with_retries(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/prawcore/sessions.py", line 266, in _request_with_retries
    raise self.STATUS_EXCEPTIONS[response.status_code](response)
prawcore.exceptions.TooManyRequests: received 429 HTTP response

My thought was if a 429 is received, the x-ratelimit-reset header is parsed (or one of the other rate limit headers) and a sleep is put until the reset is reached. The 60 to continue doesn't retry the previous attempt.

ymgenesis commented 8 months ago

On second thought, it may have been the way I was using it.

I authenticated with a different user than the account the app client/secret was created on because I didn't want to verify my email on one account. I created the app on the account I'm authenticated on and it seems to be going fine so far. Praw is still updated to 7.7.1.

Serene-Arc commented 8 months ago

I might still force an update on the package requirements since it's been a while but I'm glad it's working. Reddit has been very difficult to work with since the API changes.

ymgenesis commented 8 months ago

Makes sense. I did a rather long run yesterday of some thousand files over some hours and I didn't get any 429 responses, so that's a plus!

I also did some simple rate tests using the latest praw. From what I see it tries to keep the x-ratelimit-remaining response value on par with the x-ratelimit-reset value (converted from unix epoch time to datetime in seconds). Theoretically the requests can never drop to 0 before the reset timer resets, except it does sometimes as the matching isn't perfect.

EkriirkE commented 1 month ago

The 60 to continue doesn't retry the previous attempt.

This is my main issue with the rate limiting. It should retry the current attempt