JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.44k stars 705 forks source link

snscarape not working,giving ScraperExecution #499

Closed abhalawat closed 2 years ago

abhalawat commented 2 years ago

I am using snscrape model to extract tweets of about 4 yrs but it gave the exception just after getting 3days of data.Could anyone suggest some solution here?

Traceback (most recent call last):
  File "/home/deq/Desktop/venv/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/home/deq/Desktop/venv/lib/python3.8/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "workflow/cypto_tweet.py", line 52, in search_submissions
  File "/home/deq/Desktop/venv/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 680, in get_items
    for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', params, paginationParams, cursor = self._cursor):
  File "/home/deq/Desktop/venv/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 369, in _iter_api_data
    obj = self._get_api_data(endpoint, reqParams)
  File "/home/deq/Desktop/venv/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 339, in _get_api_data
    r = self._get(endpoint, params = params, headers = self._apiHeaders, responseOkCallback = self._check_api_response)
  File "/home/deq/Desktop/venv/lib/python3.8/site-packages/snscrape/base.py", line 216, in _get
    return self._request('GET', *args, **kwargs)
  File "/home/deq/Desktop/venv/lib/python3.8/site-packages/snscrape/base.py", line 212, in _request
    raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&q=Shiba+Inu+since%3A1495564200+until%3A1653589800&tweet_search_mode=live&count=100&query_source=spelling_expansion_revert_click&cursor=scroll%3AthGAVUV0VFVBakgLOth96DjSoWmsC9pfHBorsqEnEV85sWFYCJehgEVVNFUjUBFYjSARUAAA%3D%3D&pc=1&spelling_corrections=1&ext=mediaStats%2ChighlightedLabel failed, giving up.
JustAnotherArchivist commented 2 years ago

Impossible to tell what went wrong just from that exception. There should be at least one other, more useful log message at the ERROR level before the exception. In any case, scraping works fine past that point for me.

You could try resuming the crawl.

CLI:

snscrape twitter-search --cursor 'scroll:thGAVUV0VFVBakgLOth96DjSoWmsC9pfHBorsqEnEV85sWFYCJehgEVVNFUjUBFYjSARUAAA==' 'Shiba Inu since:1495564200 until:1653589800'

Python:

scraper = snscrape.modules.twitter.TwitterSearchScraper('Shiba Inu since:1495564200 until:1653589800', cursor = 'scroll:thGAVUV0VFVBakgLOth96DjSoWmsC9pfHBorsqEnEV85sWFYCJehgEVVNFUjUBFYjSARUAAA==')

... where the cursor comes from the URL in the error message.

Side note: since:X and until:X normally expect dates, not timestamps. Looks like that still works though, maybe. since_time:X and until_time:X would be the proper operators for timestamps. I wouldn't change the query on a resumption though to avoid potentially breaking the cursors (even though it seems to work here).

abhalawat commented 2 years ago

Okay. I will try using cursor now.and under since and until as I used Timestamp I am converting it to since_time and until_time.

abhalawat commented 2 years ago

I encountered same error again:

Task 'Search_Submissions[0]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/home/deq/Desktop/twint/venv/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/home/deq/Desktop/twint/venv/lib/python3.8/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "workflow/cypto_tweet.py", line 52, in search_submissions
  File "/home/deq/Desktop/twint/venv/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 680, in get_items
    for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', params, paginationParams, cursor = self._cursor):
  File "/home/deq/Desktop/twint/venv/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 369, in _iter_api_data
    obj = self._get_api_data(endpoint, reqParams)
  File "/home/deq/Desktop/twint/venv/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 339, in _get_api_data
    r = self._get(endpoint, params = params, headers = self._apiHeaders, responseOkCallback = self._check_api_response)
  File "/home/deq/Desktop/twint/venv/lib/python3.8/site-packages/snscrape/base.py", line 216, in _get
    return self._request('GET', *args, **kwargs)
  File "/home/deq/Desktop/twint/venv/lib/python3.8/site-packages/snscrape/base.py", line 212, in _request
    raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&q=btc+since_time%3A1495564200+until_time%3A1652898600&tweet_search_mode=live&count=100&query_source=spelling_expansion_revert_click&cursor=scroll%3AthGAVUV0VFVBaCwNOF3ajzjCoWmsC9pfHBorsqEnEVt6YeFYCJehgEVVNFUjUBFeTbARUAAA%3D%3D&pc=1&spelling_corrections=1&ext=mediaStats%2ChighlightedLabel failed, giving up.
JustAnotherArchivist commented 2 years ago

If that's the whole output, something is swallowing the log messages. As mentioned above, there should be at least one ERROR message with more information on what goes wrong. (Log messages go through the standard logging system.)

Also, what snscrape version are you using? There was a bug in the past where scrapes would crash after exactly 3 hours. This was fixed back in snscrape 0.4.0 though.

abhalawat commented 2 years ago

The version I am using right is snscrape==0.4.3.20220106. And as i was running the code on prefect its giving the log message.I will try the same without prefect and than add the error here.