JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.38k stars 702 forks source link

Twitter searches fail with `blocked (403)` #846

Closed SapiensSuwan closed 1 year ago

SapiensSuwan commented 1 year ago

OS: macOS Venture 13.3.1

Python Version: Python 3.8.3

snscrape Version: snscrape 0.6.2.20230320

snscrape module: snscrape.modules.twitter -> TwitterSearchScraper

Bug log:

Error retrieving https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=bitcoin+since%3A2021-07-01+until%3A2022-12-31&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe: blocked (403)
4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=bitcoin+since%3A2021-07-01+until%3A2022-12-31&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up.
Errors: blocked (403), blocked (403), blocked (403), blocked (403)
Traceback (most recent call last):
  File "bitcoin21-22-01.py", line 11, in <module>
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper('bitcoin since:2021-07-01 until:2022-12-31').get_items()):
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 1661, in get_items
    for obj in self._iter_api_data('https://api.twitter.com/2/search/adaptive.json', _TwitterAPIType.V2, params, paginationParams, cursor = self._cursor):
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 761, in _iter_api_data
    obj = self._get_api_data(endpoint, apiType, reqParams)
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/modules/twitter.py", line 727, in _get_api_data
    r = self._get(endpoint, params = params, headers = self._apiHeaders, responseOkCallback = self._check_api_response)
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 251, in _get
    return self._request('GET', *args, **kwargs)
  File "/Users/opt/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 247, in _request
    raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&include_ext_has_nft_avatar=1&include_ext_is_blue_verified=1&include_ext_verified_type=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_ext_limited_action_results=false&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_ext_collab_control=true&include_ext_views=true&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&include_ext_sensitive_media_warning=true&include_ext_trusted_friends_metadata=true&send_error_codes=true&simple_quoted_tweet=true&q=bitcoin+since%3A2021-07-01+until%3A2022-12-31&tweet_search_mode=live&count=20&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&include_ext_edit_control=true&ext=mediaStats%2ChighlightedLabel%2ChasNftAvatar%2CvoiceInfo%2Cenrichments%2CsuperFollowMetadata%2CunmentionInfo%2CeditControl%2Ccollab_control%2Cvibe failed, giving up.

Original Code:

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Setting variables to be used below
# maxTweets = 5000

# Creating list to append tweet data to
tweets_list1 = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i, tweet in enumerate(sntwitter.TwitterSearchScraper('bitcoin since:2021-07-01 until:2022-12-31').get_items()):
    # if i>maxTweets:
    #     break

    tweets_list1.append([tweet.date, tweet.id, tweet.rawContent, tweet.user.username])

    if (i + 1) % 5000 == 0:

        # Creating a dataframe from the tweets list above
        tweet_df1 = pd.DataFrame(tweets_list1, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])
        print(tweet_df1)

        # Export dataframe into a csv
        tweet_df1.to_csv(f'bitcoin-{(i+1)/5000}.csv', sep=',', index=False)
        tweets_list1 = []

Note: The code was working a couple days ago and now it gives me this error again. I did try to search bitcoin as keyword in twitter and it worked fine. Sorry in advance if I made any duplicated issue report. I haven't seen a 403 error here.

JustAnotherArchivist commented 1 year ago

Yeah, the exact error is indeed new, but this is effectively #834. However, the 403 affects all searches, not only the 'latest' one, so I'll leave it open anyway.

emmanuelva commented 1 year ago

The Search page from twitter seems to be gone

nicolegorton commented 1 year ago

I too am having the same error on two machines as of ~1 hour ago. Code worked perfectly 3 hours ago. Any ideas?

JustAnotherArchivist commented 1 year ago

There is no need to repeat 'me too!!1!'. Yes, everyone is affected at the moment.

Wouze commented 1 year ago

The Search page from twitter seems to be gone

yes for some reason I've seen tweets from elon about how twitter data is getting ripped of idk if there is any relation https://twitter.com/elonmusk/status/1648793947806871556?s=20

Wouze commented 1 year ago

so basically twitter search is no longer available, only available to logged in users. Is there a way of doing the same searches while being logged in? maybe by providing tokens/cookies/email&pass?

JustAnotherArchivist commented 1 year ago

@Wouze #270

AlekseiShkurin commented 1 year ago

Are there any solutions being explored at the moment? Or any temp fixes anyone discovered?

cunninghamx commented 1 year ago

:f: in chat for twitter search. Search now redirects to login. @JustAnotherArchivist i hope we as a community find another way in, but thank you so much for the hard work you have put into this

Fa5g commented 1 year ago

I think it would be a really good idea to fix snscrape to work once the user logs in and improve it to post auto messages, add follow/unfollow features, etc. taking care of the limits of Twitter, in this way the tool would not be forgotten.

TheTechRobo commented 1 year ago

@Fa5g As several people have mentioned in this thread, snscrape can't and won't support authentication, although you are welcome to make a fork of it: #270

dengkefeng commented 1 year ago

There is no need to repeat 'me too!!1!'. Yes, everyone is affected at the moment.

Could snscrape support scraping below page which is still public open. Thanks! https://twitter.com/elonmusk

JustAnotherArchivist commented 1 year ago

@dengkefeng Already supported by the twitter-profile scraper.

Hesko123 commented 1 year ago

It seems that it's now coming to an end. Fck elon musk ngl, fcked my journey.

https://nitter.net/search?f=tweets&q=testing&since=&until=&near=

Nitter seems working for searches ! (maybe a hope for a future workaround)

dengkefeng commented 1 year ago

@dengkefeng Already supported by the twitter-profile scraper.

The profile page also includes the latest tweets from this user, does twitter-profile scraper support getting these tweets? Thanks! And I will also check the code.

dengkefeng commented 1 year ago

@dengkefeng Already supported by the twitter-profile scraper.

The profile page also includes the latest tweets from this user, does twitter-profile scraper support getting these tweets? Thanks! And I will also check the code.

Supported and it is working.

ccaballes commented 1 year ago

Are we able to scrape now?

ccaballes commented 1 year ago

interesting, can anyone provide an alternative possibly (please remove if not allowed in thread). I need this for my research thesis coming up.

iGrumpyMonk commented 1 year ago

Are we able to scrape now?

You can only search tweets if you log in to twitter. In this case snscarper is no longer working for twitter unless they add authentication which is not in their roadmap.

are there any alternatives

I'm using SNSCRAPE in my graduation project and i have 3 weeks left

i need an functional alternative now

we are scraping tweets using keywords and the number of tweets for the keyword then performing sentiment analysis using nltk

Isid28 commented 1 year ago

I'm also looking for an alternatives for my thesis, if anyone knows an alternatives please help us out :-)

katesanalyst commented 1 year ago

I'm also looking for an alternatives for my thesis, if anyone knows an alternatives please help us out :-)

use nitter.net ; have to do again with code but it surely some alternative.

ccaballes commented 1 year ago

I'm also looking for an alternatives for my thesis, if anyone knows an alternatives please help us out :-)

use nitter.net ; have to do again with code but it surely some alternative.

can you shed light on this one? how do we scrape from here?

katesanalyst commented 1 year ago

can you shed light on this one? how do we scrape from here?

use requests_html or bs4 lib with python.. ''' This ccode extract TWITTER data by using nitter; as per twitter policy data extract or scraping denied from their source. Whereas some alternative usage available. '''

import sys import time from requests_html import HTMLSession import pandas as pd import os from datetime import datetime

sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8')

def tExtract(TwtpageUrl): responseExtract = session.get(TwtpageUrl) for idx,twx in enumerate(responseExtract.html.find('div.tweet-body')):

this data not an important

    # if twx.find('div.icon-container'):
    #     pinned = (twx.find('div.icon-container',first=True).text)

    date = twx.find('span.tweet-date a',first=True).text

    try:
        if twx.find('img'):
            tImage = (twx.find('img')[1].attrs['src'])
            tImageUrl = f"https://nitter.c-r-t.tk{tImage}"
    except:
        tImageUrl =''; pass

    tContent = twx.find('div.tweet-content.media-body',first=True).text.strip()

    twts = {'Date':date,'ImgUrl':tImageUrl,'Content':str(tContent),
    'Replied' : twx.find('span.tweet-stat')[0].text,
    'Retweet': twx.find('span.tweet-stat')[1].text,
    'Quote': twx.find('span.tweet-stat')[2].text,
    'Like': twx.find('span.tweet-stat')[3].text
    };#print(twts)

    twtData.append(twts)
return

def baseExt(url): response = session.get(url) try: location = response.html.find('div.profile-location span')[2].text xb = response.html.find('div.profile-bio p',first=True).text except: location = 'NA'; xb = 'NA';pass

profile = {'profileName' : response.html.find('a.profile-card-fullname',first=True).text,
'userName' : response.html.find('a.profile-card-username',first=True).text,
'bio' : xb,
'Location' : location,
'doJoined' : response.html.find('div.profile-joindate',first=True).text,
'Tweets #' : response.html.find('li.posts span')[1].text,
'Following' : response.html.find('li.following span')[1].text,
'Followers' : response.html.find('li.followers span')[1].text,
'Likes' : response.html.find('li.likes span')[1].text}

profileTweeter.append(profile);print(profile)

return

def pageCheck(url,pge): response = session.get(url) urlCollection.append(response.url); tExtract(response.url); pge +=1

try:
    checkUrl = ''.join(response.html.find('div[class="show-more"] a',first=True).absolute_links)
    if checkUrl:
        twtpageCheck = f"https://nitter.c-r-t.tk/{username}?{checkUrl.split('?')[-1]}"
        print(f"Total Data: {len(twtData)} Page: {pge} URL: {twtpageCheck}")
        pageCheck(twtpageCheck,pge)
except:
    pass
    print('Done!!!')

return

def main():

baseExt(url)
pageCheck(url,pge)

print(f"\nTotal no.of pages {len(urlCollection)} approximately...")
# tExtract(urlCollection) #Testing purpose

if (len(twtData)) != 0 :
    profileDetails = pd.DataFrame(profileTweeter); tweetDetails = pd.DataFrame(twtData)
    fxname = f'tweetData_{username}_{datetime.now():%Y-%m-%d %H-%M-%S}.xlsx'
    with pd.ExcelWriter(fxname, engine='xlsxwriter') as writer:
        profileDetails.to_excel(writer,sheet_name='TweeterData',index=False)
        tweetDetails.to_excel(writer,sheet_name=f'Tweets{username}',index=False)
        print('downloaded sucessfully....')
        print(f'\nFile downloaded and available..\n{(os.path.join(os.getcwd(),fxname))}')

if name == 'main':

un = ['katesanalyst']
username = un[1]
url = f'https://nitter.c-r-t.tk/{username}'

session =  HTMLSession()
urlCollection = []; twtData = []; profileTweeter = []; pge = 0
main()
katesanalyst commented 1 year ago

can you shed light on this one? how do we scrape from here?

use requests_html or bs4 lib with python.. ''' This ccode extract TWITTER data by using nitter; as per twitter policy data extract or scraping denied from their source. Whereas some alternative usage available. '''

import sys import time from requests_html import HTMLSession import pandas as pd import os from datetime import datetime

sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8')

def tExtract(TwtpageUrl): responseExtract = session.get(TwtpageUrl) for idx,twx in enumerate(responseExtract.html.find('div.tweet-body')): # this data not an important # if twx.find('div.icon-container'): # pinned = (twx.find('div.icon-container',first=True).text)

    date = twx.find('span.tweet-date a',first=True).text

    try:
        if twx.find('img'):
            tImage = (twx.find('img')[1].attrs['src'])
            tImageUrl = f"https://nitter.c-r-t.tk{tImage}"
    except:
        tImageUrl =''; pass

    tContent = twx.find('div.tweet-content.media-body',first=True).text.strip()

    twts = {'Date':date,'ImgUrl':tImageUrl,'Content':str(tContent),
    'Replied' : twx.find('span.tweet-stat')[0].text,
    'Retweet': twx.find('span.tweet-stat')[1].text,
    'Quote': twx.find('span.tweet-stat')[2].text,
    'Like': twx.find('span.tweet-stat')[3].text
    };#print(twts)

    twtData.append(twts)
return

def baseExt(url): response = session.get(url) try: location = response.html.find('div.profile-location span')[2].text xb = response.html.find('div.profile-bio p',first=True).text except: location = 'NA'; xb = 'NA';pass

profile = {'profileName' : response.html.find('a.profile-card-fullname',first=True).text,
'userName' : response.html.find('a.profile-card-username',first=True).text,
'bio' : xb,
'Location' : location,
'doJoined' : response.html.find('div.profile-joindate',first=True).text,
'Tweets #' : response.html.find('li.posts span')[1].text,
'Following' : response.html.find('li.following span')[1].text,
'Followers' : response.html.find('li.followers span')[1].text,
'Likes' : response.html.find('li.likes span')[1].text}

profileTweeter.append(profile);print(profile)

return

def pageCheck(url,pge): response = session.get(url) urlCollection.append(response.url); tExtract(response.url); pge +=1

try:
    checkUrl = ''.join(response.html.find('div[class="show-more"] a',first=True).absolute_links)
    if checkUrl:
        twtpageCheck = f"https://nitter.c-r-t.tk/{username}?{checkUrl.split('?')[-1]}"
        print(f"Total Data: {len(twtData)} Page: {pge} URL: {twtpageCheck}")
        pageCheck(twtpageCheck,pge)
except:
    pass
    print('Done!!!')

return

def main():

baseExt(url)
pageCheck(url,pge)

print(f"\nTotal no.of pages {len(urlCollection)} approximately...")
# tExtract(urlCollection) #Testing purpose

if (len(twtData)) != 0 :
    profileDetails = pd.DataFrame(profileTweeter); tweetDetails = pd.DataFrame(twtData)
    fxname = f'tweetData_{username}_{datetime.now():%Y-%m-%d %H-%M-%S}.xlsx'
    with pd.ExcelWriter(fxname, engine='xlsxwriter') as writer:
        profileDetails.to_excel(writer,sheet_name='TweeterData',index=False)
        tweetDetails.to_excel(writer,sheet_name=f'Tweets{username}',index=False)
        print('downloaded sucessfully....')
        print(f'\nFile downloaded and available..\n{(os.path.join(os.getcwd(),fxname))}')

if name == 'main':

un = ['katesanalyst']
username = un[1]
url = f'https://nitter.c-r-t.tk/{username}'

session =  HTMLSession()
urlCollection = []; twtData = []; profileTweeter = []; pge = 0
main()

it just plain and will be used accordingly from base..

iGrumpyMonk commented 1 year ago

can you shed light on this one? how do we scrape from here?

use requests_html or bs4 lib with python.. ''' This ccode extract TWITTER data by using nitter; as per twitter policy data extract or scraping denied from their source. Whereas some alternative usage available. ''' import sys import time from requests_html import HTMLSession import pandas as pd import os from datetime import datetime sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8') def tExtract(TwtpageUrl): responseExtract = session.get(TwtpageUrl) for idx,twx in enumerate(responseExtract.html.find('div.tweet-body')): # this data not an important # if twx.find('div.icon-container'): # pinned = (twx.find('div.icon-container',first=True).text)

    date = twx.find('span.tweet-date a',first=True).text

    try:
        if twx.find('img'):
            tImage = (twx.find('img')[1].attrs['src'])
            tImageUrl = f"https://nitter.c-r-t.tk{tImage}"
    except:
        tImageUrl =''; pass

    tContent = twx.find('div.tweet-content.media-body',first=True).text.strip()

    twts = {'Date':date,'ImgUrl':tImageUrl,'Content':str(tContent),
    'Replied' : twx.find('span.tweet-stat')[0].text,
    'Retweet': twx.find('span.tweet-stat')[1].text,
    'Quote': twx.find('span.tweet-stat')[2].text,
    'Like': twx.find('span.tweet-stat')[3].text
    };#print(twts)

    twtData.append(twts)
return

def baseExt(url): response = session.get(url) try: location = response.html.find('div.profile-location span')[2].text xb = response.html.find('div.profile-bio p',first=True).text except: location = 'NA'; xb = 'NA';pass

profile = {'profileName' : response.html.find('a.profile-card-fullname',first=True).text,
'userName' : response.html.find('a.profile-card-username',first=True).text,
'bio' : xb,
'Location' : location,
'doJoined' : response.html.find('div.profile-joindate',first=True).text,
'Tweets #' : response.html.find('li.posts span')[1].text,
'Following' : response.html.find('li.following span')[1].text,
'Followers' : response.html.find('li.followers span')[1].text,
'Likes' : response.html.find('li.likes span')[1].text}

profileTweeter.append(profile);print(profile)

return

def pageCheck(url,pge): response = session.get(url) urlCollection.append(response.url); tExtract(response.url); pge +=1

try:
    checkUrl = ''.join(response.html.find('div[class="show-more"] a',first=True).absolute_links)
    if checkUrl:
        twtpageCheck = f"https://nitter.c-r-t.tk/{username}?{checkUrl.split('?')[-1]}"
        print(f"Total Data: {len(twtData)} Page: {pge} URL: {twtpageCheck}")
        pageCheck(twtpageCheck,pge)
except:
    pass
    print('Done!!!')

return

def main():

baseExt(url)
pageCheck(url,pge)

print(f"\nTotal no.of pages {len(urlCollection)} approximately...")
# tExtract(urlCollection) #Testing purpose

if (len(twtData)) != 0 :
    profileDetails = pd.DataFrame(profileTweeter); tweetDetails = pd.DataFrame(twtData)
    fxname = f'tweetData_{username}_{datetime.now():%Y-%m-%d %H-%M-%S}.xlsx'
    with pd.ExcelWriter(fxname, engine='xlsxwriter') as writer:
        profileDetails.to_excel(writer,sheet_name='TweeterData',index=False)
        tweetDetails.to_excel(writer,sheet_name=f'Tweets{username}',index=False)
        print('downloaded sucessfully....')
        print(f'\nFile downloaded and available..\n{(os.path.join(os.getcwd(),fxname))}')

if name == 'main':

un = ['katesanalyst']
username = un[1]
url = f'https://nitter.c-r-t.tk/{username}'

session =  HTMLSession()
urlCollection = []; twtData = []; profileTweeter = []; pge = 0
main()

it just plain and will be used accordingly from base..

thanks man really really really appreciate it, u might have just saved my ass.

polinaice commented 1 year ago

How to modify this code ? I don't understand what to do. Please help / \ I heed data about daily numbers of tweets with some hashtags

import snscrape.modules.twitter as sntwitter
import pandas as pd
from tqdm import tqdm

# hashtags = ["#bitcoin", "#ethereum"]
hashtags = ["#bitcoin"]

query = " OR ".join(hashtags)

params_query = " since:2022-05-01 until:2022-06-01 lang:en"

len(query + params_query)

# Creating list to append tweet data to
# tweets_list2 = []
tweets_list2=[]
current_day=0
current_date=0
count=0
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(query + params_query).get_items()):
    if current_day!=tweet.date.day:
#         tweets_list2.append({"date":current_date, "number":count})
        tweets_list2.append([current_date, count, hashtags[0]])
        count=1
        current_day=tweet.date.day
        current_date=tweet.date
    else:
        count+=1
    if i%1000 == 0:
        print(i)
        print(current_day)
        print(tweet.date.day)
        print(tweet.date)
    # if i==100:
    #      break
# tweets_list2.append({"date":current_date, "number":count})
tweets_list2.append([current_date, count, hashtags[0]])
# Creating a dataframe from the tweets list above
# tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text', 'Username'])

tweets_list2 = tweets_list2[1:]

tweets_df = pd.DataFrame(tweets_list2, columns=['Datetime', 'Number','Hashtags'])

tweets_df.to_csv("./sample_data/"+ params_query.strip().replace(" ", "_").replace(":", "-") + ".csv")

del tweets_df
nerra0pos commented 1 year ago

It seems that it's now coming to an end. Fck elon musk ngl, fcked my journey. https://nitter.net/search?f=tweets&q=testing&since=&until=&near= Nitter seems working for searches ! (maybe a hope for a future workaround)

Is this the Nitter commit that kept search working? zedeus/nitter@5dd85c6

Seems to switch from search endpoint to graphql one.

Interesting.

anastasita308 commented 1 year ago

I tried to do Nitter, but I get a connection error every time? Not sure if I am doing something wrong.

anastasita308 commented 1 year ago

I tried to do Nitter, but I get a connection error every time? Not sure if I am doing something wrong.

Basically that's my code, which is based on what @katesanalyst kindly provided above. But every time, my code stops at request response, and I get a timeout connection error. I would be extremely grateful if someone can help because I have a coursework due Monday for which I REALLY need Twitter data.

from requests_html import HTMLSession
import datetime
import pandas as pd
import os

def tExtract(url):
    response = session.get(url)

    for twx in response.html.find('div.tweet'):
        username = twx.find('div.username')[0].text
        date = twx.find('span.tweet-date a', first=True).text
        text = twx.find('div.tweet-content')[0].text.strip()

        twts = {'Username': username, 'Date': date, 'Text': text}
        twtData.append(twts)

    return

def main():
    query = '(Plastic OR Environment OR Pollution OR Packaging OR Waste OR Climate OR Sustainability) (@Unilever) until:2019-10-05 since:2019-09-29'
    url = f'https://nitter.net/search?f=tweets&q={query}'

    session = HTMLSession()
    twtData = []

    tExtract(url)
    tweetDetails = pd.DataFrame(twtData)
    fxname = 'unilever1.xlsx'
    tweetDetails.to_excel(fxname, sheet_name=f'Tweets_{query}', index=False)
    print('Done')
    print(f'{(os.path.join(os.getcwd(), fxname))}')

if __name__ == '__main__':
    main()
samtin0x commented 1 year ago

Twitter has banned unauthenticated access to the search endpoint. Solution is to use their GraphQL endpoint - this PR seems to solve the issue.

myt-christine commented 1 year ago

Twitter has banned unauthenticated access to the search endpoint. Solution is to use their GraphQL endpoint - this PR seems to solve the issue.

hi can I connect with you on this?

anastasita308 commented 1 year ago

Twitter has banned unauthenticated access to the search endpoint. Solution is to use their GraphQL endpoint - this PR seems to solve the issue.

thanks! i am sorry to ask but would you have examples of using it because I am not sure on how to install and use it?

Hesko123 commented 1 year ago

Twitter has banned unauthenticated access to the search endpoint. Solution is to use their GraphQL endpoint - this PR seems to solve the issue.

Yeah I figured out that nitter could still display last tweets. Then I saw this commitment and it was pretty smart.

I am pretty scared they find a way to block graphql calls ? Is it possible for them or can we start calming down with elon's tyranny ?

Hesko123 commented 1 year ago

It seems that it's now coming to an end. Fck elon musk ngl, fcked my journey. https://nitter.net/search?f=tweets&q=testing&since=&until=&near= Nitter seems working for searches ! (maybe a hope for a future workaround)

Is this the Nitter commit that kept search working? zedeus/nitter@5dd85c6

Seems to switch from search endpoint to graphql one.

Sorry didn't see your message. Indeed yes. However, I am a bit concerned by the fact that they might also block graphql calls, since we'll still use Twitter's v1.1 API in thise case (hope I am wrong?)

They just need to close it and we are completely out again...

samtin0x commented 1 year ago

@Hesko123 nothing lasts forever.. @myt-christine , @anastasita308 I'm currently working on a Python implementation of litter GraphQL, don't have it yet.

Let me know if someone is working on this

Hesko123 commented 1 year ago

@Hesko123 nothing lasts forever.. @myt-christine , @anastasita308 I'm currently working on a Python implementation of litter GraphQL, don't have it yet.

Let me know if someone is working on this

Btw using their graphql endpoint could bring rate limit issue I guess ?

nicolegorton commented 1 year ago

If our use case does not require the search bar — for example if we’re going directly to a user page and pulling tweets from the last 12 hours — will snscrape still work?

zedeus commented 1 year ago

Please do not scrape nitter.net, just host your own instance. It's very easy.

Hesko123 commented 1 year ago

Please do not scrape nitter.net, just host your own instance. It's very easy.

Hey Zedeus !

Would you mind telling how we could use our own instance.

Btw is there any possibility for twitter to counter people from using their graphql endpoint, that workaround seems to be working for your case with nitter but could it work for snscrape ? Btw, Does using the graphql endpoint bringing the rate limit issue with it ?

zedeus commented 1 year ago

I've just merged the GraphQL update into master, so you can do it by following this simple guide: https://github.com/zedeus/nitter#installation

Twitter can probably make a breaking change, hard to predict. It's less of a workaround and more using a new endpoint that isn't officially used yet. GraphQL endpoints each have a rate limit of 500 requests per guest token, so if you manage it wisely you will basically never hit any rate limits.

nicolegorton commented 1 year ago

@dengkefeng Already supported by the twitter-profile scraper.

The profile page also includes the latest tweets from this user, does twitter-profile scraper support getting these tweets? Thanks! And I will also check the code.

Supported and it is working.

Would you be able to provide an example here of what you’ve got working? Thank you!!

Hesko123 commented 1 year ago

I've just merged the GraphQL update into master, so you can do it by following this simple guide: https://github.com/zedeus/nitter#installation

Twitter can probably make a breaking change, hard to predict. It's less of a workaround and more using a new endpoint that isn't officially used yet. GraphQL endpoints each have a rate limit of 500 requests per guest token, so if you manage it wisely you will basically never hit any rate limits.

What would be a request for exemple ? Is like asking the graphql to give the followers list of a twitter account counted as only one request or each answers (e.g accounts in the list) counts as a request ? I mean if someone has 500 followers, you already hit the rate limit ?

zedeus commented 1 year ago

It's per request, not per answer, it's not actually a real GraphQL API. You can check the response header x-rate-limit-remaining.

Hesko123 commented 1 year ago

It's per request, not per answer, it's not actually a real GraphQL API. You can check the response header x-rate-limit-remaining.

Okey so asking for exemple to display the latest tweets containing the keywords : "artificial intelligence" would count as one request ?

Thank you for your amazing work btw ! If everything works out well, I'll consider supporting you, deserving it, didn't hear about nitter since today.

samtin0x commented 1 year ago

@zedeus can you provide a sample request to search tweets? I think getGraphSearch is the right function to call, but we just want that and not the frontend pieces as well. It seems the website is loaded as in Django and there are not XHR/Fetch calls

Write commented 1 year ago

If our use case does not require the search bar — for example if we’re going directly to a user page and pulling tweets from the last 12 hours — will snscrape still work?

You can easily check that, yes, indeed, this still work for now with a command like this :

snscrape --max-results 10 --jsonl twitter-profile Thom_astro | jq '.content'

Unfortunately it'll miss retweet as it seems these were never returned on this endpoint IIRC.

zedeus commented 1 year ago

@Hesko123 Indeed, just one "request" per actual request you make.

@samtin0x To turn Nitter into an API server instead of a front-end, you can make it return JSON instead of HTML.

diff --git a/src/routes/search.nim b/src/routes/search.nim
index 02c14e3..b2ceefc 100644
--- a/src/routes/search.nim
+++ b/src/routes/search.nim
@@ -11,6 +11,12 @@ include "../views/opensearch.nimf"

 export search

+import jsony, times
+export jsony
+
+proc dumpHook*(s: var string, v: DateTime) =
+  s.add $v.toTime().toUnix()
+
 proc createSearchRouter*(cfg: Config) =
   router search:
     get "/search/?":
@@ -37,8 +43,7 @@ proc createSearchRouter*(cfg: Config) =
         let
           tweets = await getGraphSearch(query, getCursor())
           rss = "/search/rss?" & genQueryUrl(query)
-        resp renderMain(renderTweetSearch(tweets, prefs, getPath()),
-                        request, cfg, prefs, title, rss=rss)
+        resp tweets.toJson
       else:
         resp Http404, showError("Invalid search", cfg)
Hesko123 commented 1 year ago

@Hesko123 Indeed, just one "request" per actual request you make.

@samtin0x To turn Nitter into an API server instead of a front-end, you can make it return JSON instead of HTML.

diff --git a/src/routes/search.nim b/src/routes/search.nim
index 02c14e3..b2ceefc 100644
--- a/src/routes/search.nim
+++ b/src/routes/search.nim
@@ -11,6 +11,12 @@ include "../views/opensearch.nimf"

 export search

+import jsony, times
+export jsony
+
+proc dumpHook*(s: var string, v: DateTime) =
+  s.add $v.toTime().toUnix()
+
 proc createSearchRouter*(cfg: Config) =
   router search:
     get "/search/?":
@@ -37,8 +43,7 @@ proc createSearchRouter*(cfg: Config) =
         let
           tweets = await getGraphSearch(query, getCursor())
           rss = "/search/rss?" & genQueryUrl(query)
-        resp renderMain(renderTweetSearch(tweets, prefs, getPath()),
-                        request, cfg, prefs, title, rss=rss)
+        resp tweets.toJson
       else:
         resp Http404, showError("Invalid search", cfg)

That's really awesome.

So heard you're also bringing an unofficial API (in your roadmap). That would reduces our chances to get stoped by Elon ?

I'll join the matrix channel when I am back home.µ

Btw is there any possibility to use --since function, like if we only wants tweets from yesterday using Nitter ? Can you share a sample that would allow me to only get last 24 hours tweets ?

Thank you zed

nerra0pos commented 1 year ago

Unfortunately, Nitter does not support the list filter parameter in their search ie, list:1621518339981070336, to get only Tweets from that Twitter list.

zedeus commented 1 year ago

@Hesko123 Sure, to make it easier you can create a search query at nitter.net/search then copy the URL. Example: image Results in: https://nitter.net/search?f=tweets&q=test&since=2023-04-19&until=&near= Now just remove the empty queries (&until=&near=) or keep them, doesn't matter.

@terrapop That seems to be a limitation of the new GraphQL search endpoint. Nitter does however support lists normally: https://nitter.net/i/lists/1621518339981070336 You can apply a similar change as above to src/routes/list.nim (here it would be resp timeline.toJson on line 43 instead of respList, along with the imports/exports and dumpHook proc)