JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.39k stars 702 forks source link

Twitter scraping fails with error code 403 #900

Closed fehimornek closed 1 year ago

fehimornek commented 1 year ago

Describe the bug

I am getting a 403 blocked error while trying to scrape tweets. As recommended in #834 I used the command

pip3 install --upgrade git+https://github.com/JustAnotherArchivist/snscrape.git

This solved the issue on my windows computer though not on my Ubuntu 22.04 WSL environment.

How to reproduce

I am using windows subsystem for linux for running ubuntu on my Windows 10 pc. This is the code to reproduce:

date = datetime.now()
previous_day = date - timedelta(days=1)

date = date.strftime("%d-%m-%Y")
previous_day = previous_day.strftime("%d-%m-%Y")

query = f"since:{previous_day} until:{date} (#NASDAQ OR #Stocks)"
tweets = []
i = 0
for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    print(i)
    tweets.append([tweet.date, tweet.user, tweet.rawContent, tweet.likeCount, tweet.hashtags])
    i += 1
tweet_df = pd.DataFrame(tweets, columns = ["Date", "User", "Content", "Likes", "Hashtags"])
tweet_df.to_csv(f"mnt/d/coding/scripts/python_data_science/sentiment_analysis/tweets/{since}.csv")

Expected behaviour

The tweets should be scraped.

Screenshots and recordings

No response

Operating system

Ubuntu 22.04

Python version: output of python3 --version

python 3.10.6

snscrape version: output of snscrape --version

snscrape 0.6.2.20230320

Scraper

TwitterSearchScraper

How are you using snscrape?

Module (import snscrape.modules.something in Python code)

Backtrace

Error retrieving https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=since%3A15-05-2023+until%3A16-05-2023+%28%23sp500+OR+%23dow+OR+%23NASDAQ+OR+%23Stocks%29&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe: blocked (403) 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=since%3A15-05-2023+until%3A16-05-2023+%28%23sp500+OR+%23dow+OR+%23NASDAQ+OR+%23Stocks%29&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up. Errors: blocked (403), blocked (403), blocked (403), blocked (403) Traceback (most recent call last): File "/home/development/pull_tweet.py", line 13, in for tweet in sntwitter.TwitterSearchScraper(query).get_items(): File "/usr/local/lib/python3.10/dist-packages/snscrape/modules/twitter.py", line 1661, in get_items for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', _TwitterAPIType.V2, params, paginationParams, cursor = self._cursor): File "/usr/local/lib/python3.10/dist-packages/snscrape/modules/twitter.py", line 761, in _iter_api_data obj = self._get_api_data(endpoint, apiType, reqParams) File "/usr/local/lib/python3.10/dist-packages/snscrape/modules/twitter.py", line 727, in _get_api_data r = self._get(endpoint, params = params, headers = self._apiHeaders, responseOkCallback = self._check_api_response) File "/usr/local/lib/python3.10/dist-packages/snscrape/base.py", line 251, in _get return self._request('GET', *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/snscrape/base.py", line 247, in _request raise ScraperException(msg) snscrape.base.ScraperException: 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=since%3A15-05-2023+until%3A16-05-2023+%28%23sp500+OR+%23dow+OR+%23NASDAQ+OR+%23Stocks%29&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up.

Log output

No response

Dump of locals

No response

Additional context

No response

ojansan commented 1 year ago

same as mine

VictorHan97 commented 1 year ago

Same; I wonder this may be related with Twitter's update.

AntoinePaix commented 1 year ago

@edilgin @DanRev111212 @ojansan @VictorHan97

You scrape the old twitter api. You need to update the repo to scrape the graphql search endpoint.

fehimornek commented 1 year ago

How can this be achieved? @AntoinePaix

AntoinePaix commented 1 year ago

@edilgin pip3 install --upgrade git+https://github.com/JustAnotherArchivist/snscrape.git

ojansan commented 1 year ago

@AntoinePaix thank you, already success in local, but how can i do this on GCP? i tried add snscrape==0.6.2.20230321.dev13+g786815d in my requirements.txt but got error while compiling. thanks

fehimornek commented 1 year ago

@AntoinePaix If you look at the Describe the bug section I already said I tried that solution.

DanRev111212 commented 1 year ago

@edilgin pip3 install --upgrade git+https://github.com/JustAnotherArchivist/snscrape.git

Thank you so much. It's working now locally. My life is saved. I just hope it keeps working for the next 6 months.

AntoinePaix commented 1 year ago

@edilgin Sorry, I didn't pay attention to your first message.

Remove and reinstall the package. If you are using a virtual environment, delete it and make a new one.

JustAnotherArchivist commented 1 year ago

884

ehsong commented 1 year ago

Was anyone able to scrape twitter successfully using twittersearchscraper? @edilgin @DanRev111212 @ojansan @VictorHan97 @AntoinePaix I also get the 403 error and timeout errors after updating the package from git