JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.47k stars 708 forks source link

Twitter searches fail with `blocked (403)` #846

Closed SapiensSuwan closed 1 year ago

SapiensSuwan commented 1 year ago

OS: macOS Venture 13.3.1

Python Version: Python 3.8.3

snscrape Version: snscrape 0.6.2.20230320

snscrape module: snscrape.modules.twitter -> TwitterSearchScraper

Bug log:

Error retrieving https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=bitcoin+since%3A2021-07-01+until%3A2022-12-31&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe: blocked (403)
4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=bitcoin+since%3A2021-07-01+until%3A2022-12-31&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up.
Errors: blocked (403), blocked (403), blocked (403), blocked (403)
Traceback (most recent call last):
  File "bitcoin21-22-01.py", line 11, in <module>
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper('bitcoin since:2021-07-01 until:2022-12-31').get_items()):
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 1661, in get_items
    for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', _TwitterAPIType.V2, params, paginationParams, cursor = self._cursor):
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 761, in _iter_api_data
    obj = self._get_api_data(endpoint, apiType, reqParams)
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 727, in _get_api_data
    r = self._get(endpoint, params = params, headers = self._apiHeaders, responseOkCallback = self._check_api_response)
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 251, in _get
    return self._request('GET', *args, **kwargs)
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 247, in _request
    raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=bitcoin+since%3A2021-07-01+until%3A2022-12-31&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up.

Original Code:

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Setting variables to be used below
# maxTweets = 5000

# Creating list to append tweet data to
tweets_list1 = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('bitcoin since:2021-07-01 until:2022-12-31').get_items()):
    # if i>maxTweets:
    #     break

    tweets_list1.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username])

    if (i + 1) % 5000 == 0:

        # Creating a dataframe from the tweets list above
        tweet_df1 = pd.DataFrame(tweets_list1, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])
        print(tweet_df1)

        # Export dataframe into a csv
        tweet_df1.to_csv(f'bitcoin-{(i+1)/5000}.csv', sep=',', index=False)
        tweets_list1 = []

Note: The code was working a couple days ago and now it gives me this error again. I did try to search bitcoin as keyword in twitter and it worked fine. Sorry in advance if I made any duplicated issue report. I haven't seen a 403 error here.

samtin0x commented 1 year ago

@Hesko123 Sure, to make it easier you can create a search query at nitter.net/search then copy the URL. Example: image Results in: https://nitter.net/search?f=tweets&q=test&since=2023-04-19&until=&near= Now just remove the empty queries (&until=&near=) or keep them, doesn't matter.

@terrapop That seems to be a limitation of the new GraphQL search endpoint. Nitter does however support lists normally: https://nitter.net/i/lists/1621518339981070336 You can apply a similar change as above to src/routes/list.nim (here it would be resp timeline.toJson on line 43 instead of respList, along with the imports/exports and dumpHook proc)

@zedeus Does Nitter / GraphQL endpoints support filters by timestamp? I want to get the tweets for the last 10 minutes only.

Thanks!

Edit: possible using since:2021-01-06_13:00:00_UTC and until:2021-01-06_14:00:00_UTC

Hesko123 commented 1 year ago

@zedeus How can we use the twitter graphql? which one does nitter use to get twitter graphql endpoint.

zedeus commented 1 year ago

Which what? Several endpoints are used, you can see a comprehensive list of available endpoints here https://github.com/fa0311/TwitterInternalAPIDocument/blob/master/docs/markdown/GraphQL.md

trevorhobenshield commented 1 year ago

interesting, can anyone provide an alternative possibly (please remove if not allowed in thread). I need this for my research thesis coming up.

https://github.com/trevorhobenshield/twitter-api-client#search

Working search. Just need email,username,password for auth

trevorhobenshield commented 1 year ago

@Hesko123 nothing lasts forever.. @myt-christine , @anastasita308 I'm currently working on a Python implementation of litter GraphQL, don't have it yet.

Let me know if someone is working on this

https://github.com/trevorhobenshield/twitter-api-client#search

nerra0pos commented 1 year ago

interesting, can anyone provide an alternative possibly (please remove if not allowed in thread). I need this for my research thesis coming up.

https://github.com/trevorhobenshield/twitter-api-client#search

Working search. Just need email,username,password for auth

Unfortunately, this is not viable for even half-extensive scrape jobs. It will get your account locked or suspended fast.

Factual0367 commented 1 year ago

Are there any solutions being explored at the moment? Or any temp fixes anyone discovered?

I managed to get it to work by using selenium-wire to login to Twitter, conducting a search, getting the headers for the relevant network request and passing it to _TwitterAPIScraper. Not a pretty solution, but it works for now.

Wouze commented 1 year ago

Are there any solutions being explored at the moment? Or any temp fixes anyone discovered?

I managed to get it to work by using selenium-wire to login to Twitter, conducting a search, getting the headers for the relevant network request and passing it to _TwitterAPIScraper. Not a pretty solution, but it works for now.

Could you please share code? Thanks

Factual0367 commented 1 year ago

Are there any solutions being explored at the moment? Or any temp fixes anyone discovered?

I managed to get it to work by using selenium-wire to login to Twitter, conducting a search, getting the headers for the relevant network request and passing it to _TwitterAPIScraper. Not a pretty solution, but it works for now.

Could you please share code? Thanks

Sure. Here is the modified Twitter module, you can just add it to the modules and rebuild the package and install it. It needs selenium and selenium-wire to work. It is hacked together very quickly so do not have a heart attack when you see it :) And it does not work with selenium headless.

twitter.zip

zaferk19 commented 1 year ago

Are there any solutions being explored at the moment? Or any temp fixes anyone discovered?

I managed to get it to work by using selenium-wire to login to Twitter, conducting a search, getting the headers for the relevant network request and passing it to _TwitterAPIScraper. Not a pretty solution, but it works for now.

Could you please share code? Thanks

Sure. Here is the modified Twitter module, you can just add it to the modules and rebuild the package and install it. It needs selenium and selenium-wire to work. It is hacked together very quickly so do not have a heart attack when you see it :) And it does not work with selenium headless.

twitter.zip

allah razı olsun kardeşim

ghost commented 1 year ago

Gentlemen, it has been a privilege and an honor.

nerra0pos commented 1 year ago

So, search is dead, or is this "graphql" approach a way forward?

Factual0367 commented 1 year ago

So, search is dead, or is this "graphql" a way forward?

Still works if you add the auth headers to the request. See my comment above.

JustAnotherArchivist commented 1 year ago

I will be implementing the GraphQL endpoint (thanks @zedeus!). I don't have an ETA for this yet, but Soon™.

juanluidos commented 1 year ago

I will be implementing the GraphQL endpoint (thanks @zedeus!). I don't have an ETA for this yet, but Soon™.

Nice, Thanks for the hard work!

Hesko123 commented 1 year ago

I will be implementing the GraphQL endpoint (thanks @zedeus!). I don't have an ETA for this yet, but Soon™.

Thank you for your amazing work.

The problem is stillt the fact that they can basically get us out from the endpoint too.. This is a cat and mouse game obviously but these dudes are clearly monitoring here for their next move. Pretty sure.

Is there a possibility in the future of forking twitter's API or I mean not depending of them ?

zorrobiwan commented 1 year ago

They probably read us here 🙄😬👀

nv78 commented 1 year ago

Thanks so much!

Gasolgasol commented 1 year ago

Whoever is working to keep snscrape alive, God bless you, May the Force be with you

ccaballes commented 1 year ago

Please let us know if its working now. Great help to my thesis (will definitely mention you @JustAnotherArchivist !! )

Isid28 commented 1 year ago

what's the update?

AliHmaou commented 1 year ago

I just had time to train myself for a few days with this package, learned a lot ! Looks like Elon just ended the era of massive, freely available human data for model training. We can expect other platforms to follow and limit free scraping opportunities, especially if he wins his case in the courts.

Germardies commented 1 year ago

Has the problem been solved? If so, are there any instructions on what to do about the problem? In other words, a proper manual with the relevant commands.

nerra0pos commented 1 year ago

Soon™

Thank you! Is Soon™ before 4/29? Because that is the day when Twitter will sunset all old API plans.

frenchdevartist commented 1 year ago

Would it be so difficult to connect to Twitter before scrapping it ?

Wouze commented 1 year ago

Would it be so difficult to connect to Twitter before scrapping it ?

You mean authentication by logging in? This will add complexity, and it is not in the roadmap of snscrape Look #270

AhmetcanFR commented 1 year ago

How much scrapping can i do with an twitter account to get tweets?

polinaice commented 1 year ago

I will be implementing the GraphQL endpoint (thanks @zedeus!). I don't have an ETA for this yet, but Soon™.

Nice, Thanks for the hard work!

I dont understand what ETA means. Can anyone explain?

ErVijayRaghuwanshi commented 1 year ago

I will be implementing the GraphQL endpoint (thanks @zedeus!). I don't have an ETA for this yet, but Soon™.

Nice, Thanks for the hard work!

I dont understand what ETA means. Can anyone explain?

Estimated time of arrival (ETA)

JustAnotherArchivist commented 1 year ago

@terrapop I want to do so this weekend, but no guarantees.

AntoinePaix commented 1 year ago

I wrote some scraping code on my side and the graphql API is easier to scrape than the previous advanced search API.

There are also some additional features: bookmarks are included, rate-limiting is higher and response time is faster.

nerra0pos commented 1 year ago

I wrote some scraping code on my side and the graphql API is easier to scrape than the previous advanced search API.

There are also some additional features: bookmarks are included, rate-limiting is higher and response time is faster.

Can you also scrape results from Twitter lists with the graphql approach? And filter down with former search parameters?

AhmetcanFR commented 1 year ago

I wrote some scraping code on my side and the graphql API is easier to scrape than the previous advanced search API.

There are also some additional features: bookmarks are included, rate-limiting is higher and response time is faster.

Could you share the code, also is the graphql API gonna stay in forever?

AntoinePaix commented 1 year ago

@terrapop I've not tested yet all filters. Some filters like native_videos don't work.

@AhmetcanFR I don't know if this API will be available forever and no, for some reason I can't post my code. But I know @JustAnotherArchivist will adapt his code successfully.

nerra0pos commented 1 year ago

@terrapop I've not tested yet all filters. Some filters like native_videos don't work.

Understood. But do you know if filtering by list plus some standard other parameters works with "graphql search" ie "list:XXXXXXX -filter:retweets" as an example for the old query?

AntoinePaix commented 1 year ago

@terrapop list dork does not work :-(

edsu commented 1 year ago

I tried updating the token that snscrape has been using for Twitter, to be one from my active session and that didn't seem to be enough to fix twitter-search. So I guess Twitter must be looking for a valid cookie and/or x-csrf-token now too?

https://github.com/JustAnotherArchivist/snscrape/blob/3dd9c28e31b8babeb2a187fbae994d9717ded168/snscrape/modules/twitter.py#L44

AhmetcanFR commented 1 year ago

@terrapop I've not tested yet all filters. Some filters like native_videos don't work.

@AhmetcanFR I don't know if this API will be available forever and no, for some reason I can't post my code. But I know @JustAnotherArchivist will adapt his code successfully.

could i atleast get to know what endpoint you use and how you use it?

mox8 commented 1 year ago

@terrapop I've not tested yet all filters. Some filters like native_videos don't work. @AhmetcanFR I don't know if this API will be available forever and no, for some reason I can't post my code. But I know @JustAnotherArchivist will adapt his code successfully.

could i atleast get to know what endpoint you use and how you use it?

there are a few graphql endpoints mentioned in the repo: https://github.com/JustAnotherArchivist/snscrape/blob/3dd9c28e31b8babeb2a187fbae994d9717ded168/snscrape/modules/twitter.py#L1693 https://github.com/JustAnotherArchivist/snscrape/blob/3dd9c28e31b8babeb2a187fbae994d9717ded168/snscrape/modules/twitter.py#L1696 https://github.com/JustAnotherArchivist/snscrape/blob/3dd9c28e31b8babeb2a187fbae994d9717ded168/snscrape/modules/twitter.py#L1939 https://github.com/JustAnotherArchivist/snscrape/blob/3dd9c28e31b8babeb2a187fbae994d9717ded168/snscrape/modules/twitter.py#L2028

hope that it could help

edsu commented 1 year ago

It does look like Nitter might be using a GraphQL endpoint for search?

https://github.com/zedeus/nitter/blob/master/src/consts.nim#L21

And it's still working: https://nitter.net/search?f=tweets&q=justanotherarchivist&since=&until=&near=

Wouze commented 1 year ago

It does look like Nitter might be using a GraphQL endpoint for search?

https://github.com/zedeus/nitter/blob/master/src/consts.nim#L21

And it's still working: https://nitter.net/search?f=tweets&q=justanotherarchivist&since=&until=&near=

They are in fact. The owner of nitter did come to this issue and had helped the people here to know how he is doing it :)

erikcas commented 1 year ago

I tried updating the token that snscrape has been using for Twitter, to be one from my active session and that didn't seem to be enough to fix twitter-search. So I guess Twitter must be looking for a valid cookie and/or x-csrf-token now too?

https://github.com/JustAnotherArchivist/snscrape/blob/3dd9c28e31b8babeb2a187fbae994d9717ded168/snscrape/modules/twitter.py#L44

Yes, a users bearer token, cookie and csrf were needed to get and it up and running. Using that as a quick and dirty workaround for now. Doing some reading on the graphql now.

mox8 commented 1 year ago

@erikcas Did you have a chance to test it's rps and stability? Mostly sure that twitter has some king of rate-limit for authorised users (violating which leads to a ban).

Nothing more more than a guess, but has to be tested for sure.

erikcas commented 1 year ago

I tried updating the token that snscrape has been using for Twitter, to be one from my active session and that didn't seem to be enough to fix twitter-search. So I guess Twitter must be looking for a valid cookie and/or x-csrf-token now too? https://github.com/JustAnotherArchivist/snscrape/blob/3dd9c28e31b8babeb2a187fbae994d9717ded168/snscrape/modules/twitter.py#L44

Yes, a users bearer token, cookie and csrf were needed to get and it up and running. Using that as a quick and dirty workaround for now. Doing some reading on the graphql now.

Would you share how you have done it? I'm not as savvy to implement it myself. Cheers

@erikcas Did you have a chance to test it's rps and stability? Mostly sure that twitter has some king of rate-limit for authorised users (violating which leads to a ban).

Nothing more more than a guess, but has to be tested for sure.

it seems to be heavily ratelimited. That is why I will not share it. I donot have plans to support possible bans on a sunday.

Wouze commented 1 year ago

Is thre any near hope?

frenchdevartist commented 1 year ago

Hi, In fact the main problem is that search via Twitter API are very limited. For example, it is impossible to filter with likes, views or localisation.

hamzaerors commented 1 year ago

@JustAnotherArchivist 4 requests to https://api.twitter.com/2/search/adaptive.json?
I hope it will be resolved soon :(

musicwil commented 1 year ago

Good morning,

I don’t think it’s a matter of resolving an issue. I think Twitter is looking for ways to increase its revenue stream. But then, I may be wrong. I guess time will tell.

On Mon, Apr 24, 2023 at 7:14 AM hamzaerors @.***> wrote:

@JustAnotherArchivist https://github.com/JustAnotherArchivist 4 requests to https://api.twitter.com/2/search/adaptive.json? I hope it will be resolved soon :(

— Reply to this email directly, view it on GitHub https://github.com/JustAnotherArchivist/snscrape/issues/846#issuecomment-1519940178, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW75HMYZL36FQBWY2YTWFJLXCZOBPANCNFSM6AAAAAAXGDDL4I . You are receiving this because you are subscribed to this thread.Message ID: @.***>

DrSocket commented 1 year ago

Are there any solutions being explored at the moment? Or any temp fixes anyone discovered?

I managed to get it to work by using selenium-wire to login to Twitter, conducting a search, getting the headers for the relevant network request and passing it to _TwitterAPIScraper. Not a pretty solution, but it works for now.

Could you please share code? Thanks

Sure. Here is the modified Twitter module, you can just add it to the modules and rebuild the package and install it. It needs selenium and selenium-wire to work. It is hacked together very quickly so do not have a heart attack when you see it :) And it does not work with selenium headless.

twitter.zip

@onurhanak Hi, I'm trying to implement it but getting a 403 error, did you actually get results with that? I'm not sure I'm missing something because I didn't copy paste and did a few modifications you also import snscrape.utils which is not in the official repo plus many other imports, but the only difference in the files is some things in the _make_card function which doesn't seem like it should change anything. I have an add_headers function that works and returns the object with headers to self._apiHeaders. did you modify something else to make it work? do you have a fork you could link to have a better idea? thanks

doveppp commented 1 year ago

@DrSocket You can clone this repo and try again. I added the following code and it worked well. image