dmarx / psaw

Python Pushshift.io API Wrapper (for comment/submission search)
BSD 2-Clause "Simplified" License
359 stars 53 forks source link

MaxRetries exceeded when retrieving huge data file #35

Closed plygrnd closed 5 years ago

plygrnd commented 5 years ago

I'm trying to download an entire subreddit, for reasons (mostly populating an Elasticsearch database for analytics). It's approximately 10 years of data. PSAW throws a "MaxRetries exceeded" when trying to pull the whole lot in one go. Is this expected?

2018-10-21 12:25:25,080 - __main__ - INFO - Sleeping 60 seconds to let ES start up.
2018-10-21 12:26:25,118 - __main__ - INFO - Importing subreddit.
2018-10-21 12:26:26,611 - __main__ - DEBUG - Attempting to search the index for the last post
2018-10-21 12:26:27,703 - tinkerbell.subreddit - INFO - Fetching historical submissions from Pushshift.
/usr/lib/python3.6/site-packages/psaw/PushshiftAPI.py:153: UserWarning: Unable to connect to pushshift.io. Retrying after backoff.
  warnings.warn("Unable to connect to pushshift.io. Retrying after backoff.")
Traceback (most recent call last):
  File "runtime/redditstream.py", line 81, in <module>
    t.tinkerbell()
  File "runtime/redditstream.py", line 65, in tinkerbell
    datetime.strftime(datetime.now().date(), '%Y/%m/%d')
  File "/runtime/tinkerbell/subreddit.py", line 149, in fetch_submissions
    submissions = [x for x in submissions]
  File "/runtime/tinkerbell/subreddit.py", line 149, in <listcomp>
    submissions = [x for x in submissions]
  File "/usr/lib/python3.6/site-packages/psaw/PushshiftAPI.py", line 192, in _search
    for response in self._handle_paging(url):
  File "/usr/lib/python3.6/site-packages/psaw/PushshiftAPI.py", line 178, in _handle_paging
    yield self._get(url, self.payload)
  File "/usr/lib/python3.6/site-packages/psaw/PushshiftAPI.py", line 162, in _get
    raise Exception("Unable to connect to pushshift.io. Max retries exceeded.")
Exception: Unable to connect to pushshift.io. Max retries exceeded.
dmarx commented 5 years ago

It looks like you weren't able to connect to the API at all. This probably means the server was down when you got this error. Have you tried again since? If you still have issues, can you post an example that reproduces this error?

plygrnd commented 5 years ago

You are correct; sorry for wasting your time. I added some debugging statements and found that I was passing a null value in as one of the time stamps :(

—durson (mobile)

On 21 Oct 2018, at 22:39, David Marx notifications@github.com wrote:

It looks like you weren't able to connect to the API at all. This probably means the server was down when you got this error. Have you tried again since? If you still have issues, can you post an example that reproduces this error?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

dmarx commented 5 years ago

No problem, don't hesitate to open a new issue if you encounter problems in the future.