JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.5k stars 712 forks source link

Error reddit scraping #1005

Open MatFrancois opened 1 year ago

MatFrancois commented 1 year ago

Describe the bug

I have the following error when I try to scrape reddit : snscrape.base.ScraperException: 4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up. I also tried with the python package and subreddit search but it doesn't work either. I tried to do it from another device but the same result... Any idea?

How to reproduce

run snscrape -n 100 -vv reddit-search toto

Expected behaviour

Get data?

Screenshots and recordings

No response

Operating system

Kubuntu 22.04

Python version: output of python3 --version

3.8.8

snscrape version: output of snscrape --version

snscrape 0.7.0.20230622

Scraper

reddit-search

How are you using snscrape?

CLI (snscrape ... as a command, e.g. in a terminal)

Backtrace

snscrape.base.ScraperException: 4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up.

Log output

2023-07-05 09:56:59.215  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:56:59.216  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:56:59.216  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:56:59.217  DEBUG  urllib3.connectionpool  Starting new HTTPS connection (1): api.pushshift.io:443
2023-07-05 09:56:59.285  DEBUG  snscrape.base  Connected to: ('172.67.219.85', 443)
2023-07-05 09:56:59.285  DEBUG  snscrape.base  Connection cipher: ('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256)
2023-07-05 09:56:59.682  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:56:59.684  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:56:59.684  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:56:59 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=y4%2BHEpMTSBJTXPGm4t7j95SB7FGvFVoPOkhN7%2BoPzIMt8rFnrbVatyYC2TKIviyCyOuaYt%2B%2FtN02NPN3AZa%2BCtunP7oatjwYM8k51iOBRkXNrBcTndwFIxVJTfEqILZlwQTp"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0df6bb3ad4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:56:59.684  INFO  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code, retrying
2023-07-05 09:56:59.684  INFO  snscrape.base  Waiting 1 seconds
2023-07-05 09:57:00.687  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:57:00.687  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:57:00.688  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:57:00.809  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:57:00.810  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:57:00.811  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:57:00 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=RRI7V4%2FKORopA2%2FQFWbrrUnkFlm%2Ftd5O9SismrizB9mCRFBeF2tTFM0L%2FhbJTzPPwHYyQiOZ6ZzhjUyUc%2BkSPQla5B1BqN%2BTV3LcE2%2Fv3y9Q%2FYeQHPp6gIGrjqjfaDO8dRC3"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0dff78f0d4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:57:00.811  INFO  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code, retrying
2023-07-05 09:57:00.811  INFO  snscrape.base  Waiting 2 seconds
2023-07-05 09:57:02.815  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:57:02.815  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:57:02.815  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:57:02.938  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:57:02.938  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:57:02.939  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:57:02 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=iwHMuR85a9T6e4AsOzZ3nlYUMI4G2ke71fL7PEhrcNRyy%2BUhlTw9OhJgogU4NAWUKAY1gXhPNQgoSAZSct65B2fLZviQvfVhJwWAS7EWe%2BG0jcjKm4ot9p11cAMDQQQLmJ3P"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0e0cc998d4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:57:02.939  INFO  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code, retrying
2023-07-05 09:57:02.939  INFO  snscrape.base  Waiting 4 seconds
2023-07-05 09:57:06.945  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:57:06.945  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:57:06.945  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:57:07.066  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:57:07.067  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:57:07.067  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:57:07 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=sbdEsUwu7UnHrollCV0oSOt0FUSXPBvUgjqRiWXSV0A%2BNpdcPvsXdaETxaF8GYBdD0k02i5vWa8sK%2FnZnSCNU5T0VPs3FMTx5yhC7E9LkDFzczUz5ZkXmrzoHoN4%2FcQEJYqI"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0e2699d8d4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:57:07.067  ERROR  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code
2023-07-05 09:57:07.067  CRITICAL  snscrape.base  4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up.
2023-07-05 09:57:07.067  CRITICAL  snscrape.base  Errors: non-200 status code, non-200 status code, non-200 status code, non-200 status code
2023-07-05 09:57:07.118  CRITICAL  snscrape._cli  Dumped stack and locals to /tmp/snscrape_locals_j8mi7h4g
Traceback (most recent call last):
  File "/home/matthieu-inspiron/anaconda3/bin/snscrape", line 8, in <module>
    sys.exit(main())
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/_cli.py", line 323, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 219, in get_items
    yield from self._iter_api_submissions_and_comments({type(self)._apiField: self._name})
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 185, in _iter_api_submissions_and_comments
    tipSubmission = next(submissionsIter)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 143, in _iter_api
    obj = self._get_api(url, params = params)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 94, in _get_api
    r = self._get(url, params = params, headers = self._headers, responseOkCallback = self._handle_rate_limiting)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 275, in _get
    return self._request('GET', *args, **kwargs)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 271, in _request
    raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up.

Dump of locals

I prefer to send it in private

Additional context

No response

JustAnotherArchivist commented 1 year ago

Pushshift is effectively dead, so yeah, this is expected and can't work anymore. Pushshift was the only way to retrieve (a) useful search results since Reddit's own search is awful, (b) get all submissions in a subreddit since Reddit limits that to 1000 results, and (c) get all submissions/comments by a user due to the same limitation on Reddit.

Potentially, PullPush could serve as a replacement, but since Reddit's API changes are rolling out this month, I'll wait for that to happen before making any changes.

(If the Reddit API itself is sufficient for your purposes, I recommend using PRAW rather than snscrape.)

ihabpalamino commented 1 year ago

@JustAnotherArchivist can we have any news about the twitter's problem will it be a solution?or its death?and if is it possible to have informations about the progress of the solution?