JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.31k stars 698 forks source link

Unable to use snscrape on AWS EC2 (Ubuntu 22.04) #973

Closed milansavani closed 1 year ago

milansavani commented 1 year ago

I would like to introduce myself as a newcomer to the world of Python and web scraping. My objective is to scrape user data and tweets, and I have successfully achieved this on my local system. However, I encountered an error when attempting to execute the same code on an AWS EC2 instance. I have attached the error logs below for your perusal.

Here are the steps I followed

  1. installed package via pip3 install snscrape.
  2. fetch details using the command snscrape twitter-user textfiles

OS

Description: Ubuntu 22.04.2 LTS

Release: 22.04

Python

Python 3.11.4

snscrape

snscrape 0.6.2.20230320

Error Log


2023-06-17 09:32:22.936  CRITICAL  snscrape.base  4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=from%3Atextfiles&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up.
2023-06-17 09:32:22.936  CRITICAL  snscrape.base  Errors: blocked (403), blocked (403), blocked (403), blocked (403)
2023-06-17 09:32:22.950  CRITICAL  snscrape._cli  Dumped stack and locals to /tmp/snscrape_locals_y_gnv64i
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/snscrape", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ubuntu/.local/lib/python3.11/site-packages/snscrape/_cli.py", line 320, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/ubuntu/.local/lib/python3.11/site-packages/snscrape/modules/twitter.py", line 1755, in get_items
    yield from super().get_items()
  File "/home/ubuntu/.local/lib/python3.11/site-packages/snscrape/modules/twitter.py", line 1661, in get_items
    for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', _TwitterAPIType.V2, params, paginationParams, cursor = self._cursor):
  File "/home/ubuntu/.local/lib/python3.11/site-packages/snscrape/modules/twitter.py", line 761, in _iter_api_data
    obj = self._get_api_data(endpoint, apiType, reqParams)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.11/site-packages/snscrape/modules/twitter.py", line 727, in _get_api_data
    r = self._get(endpoint, params = params, headers = self._apiHeaders, responseOkCallback = self._check_api_response)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.11/site-packages/snscrape/base.py", line 251, in _get
    return self._request('GET', *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.11/site-packages/snscrape/base.py", line 247, in _request
    raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=from%3Atextfiles&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up.```
JustAnotherArchivist commented 1 year ago

AWS has been banned entirely by Twitter since at least late 2020, probably even longer: #79

milansavani commented 1 year ago

@JustAnotherArchivist Getting the same error on my local Virtual Linux machine (Ubuntu 22.04) that run in VMWare fusion on MacBook M1 Pro.

Also In the AWS environment, I can scrape replies, but not user information and user's tweets. Same in the Virtual Linux machine.

JustAnotherArchivist commented 1 year ago

Outside of AWS (and similarly banned platforms), it's #846.

It's possible that the IP ban for AWS doesn't apply to everything. I never investigated that. In any case, it's something Twitter is doing, and there's no workaround for it other than, well, not using AWS to talk to Twitter's servers.

AntoinePaix commented 1 year ago

My remark may be irrelevant but why does the code still query the adaptive api when it has been restricted since the end of April?

Doesn't the code send requests to the SearchTimeline API now?

milansavani commented 1 year ago

@JustAnotherArchivist When I tried to install the development version then It show the "Unknown" package installed. How Can I use that?

Logs:


Collecting git+https://github.com/JustAnotherArchivist/snscrape.git
  Cloning https://github.com/JustAnotherArchivist/snscrape.git to /tmp/pip-req-build-6w2mzmkz
  Running command git clone --filter=blob:none --quiet https://github.com/JustAnotherArchivist/snscrape.git /tmp/pip-req-build-6w2mzmkz
  Resolved https://github.com/JustAnotherArchivist/snscrape.git to commit 0d824ab77334ed4ab6250e5e491171afeccfb298
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: UNKNOWN
  Building wheel for UNKNOWN (pyproject.toml) ... done
  Created wheel for UNKNOWN: filename=UNKNOWN-0.6.2.20230321.dev50+g0d824ab-py3-none-any.whl size=13562 sha256=6120fee5bc936de13bdd4c1cbfdae6c3e3606f4d4d8168cf61f3ccb5252c0b16
  Stored in directory: /tmp/pip-ephem-wheel-cache-fwcz3thi/wheels/05/e9/f7/57056e7c7e44b1feed932fa49fdec9d706c4f563e37160ab74
Successfully built UNKNOWN
Installing collected packages: UNKNOWN
Successfully installed UNKNOWN-0.6.2.20230321.dev50+g0d824ab```
JustAnotherArchivist commented 1 year ago

@AntoinePaix Because they aren't using the dev version. I intend to finally cut a new release this week, hopefully today.

@milansavani This is also a duplicate (#905). Please search the issue tracker for your problems.