JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.31k stars 698 forks source link

All Twitter scrapes are failing: `blocked (404)` #996

Open JustAnotherArchivist opened 1 year ago

JustAnotherArchivist commented 1 year ago

With the exception of twitter-trends, all Twitter scrapes are failing since sometime in the past hour. This is likely connected to Twitter as a whole getting locked behind a login wall since earlier today. There is no known workaround at this time, and it's not known whether this will be fixable.

JustAnotherArchivist commented 1 year ago

No, my previous comment still applies.

The thing is that I don't want to have to (read: don't have time to) change everything again in two days when Elon has another one of his brilliant ideas, so I'm waiting for the dust to settle down a bit.

prasunshrestha commented 1 year ago

Yes, it is a different endpoint which only returns the single requested tweet, no replies or the replied-to tweet.

I will be happy if only I could get the content of the tweet. Please correct me if I am wrong, but the only way possible I see at the moment (without authentication) is through the embedded tweets. As a result, if only I could get the post IDs, I can get the content. Is there a way possible at the moment?

ihabpalamino commented 1 year ago

No, my previous comment still applies.

The thing is that I don't want to have to (read: don't have time to) change everything again in two days when Elon has another one of his brilliant ideas, so I'm waiting for the dust to settle down a bit.

okey thank you brother hope it wont take a loong time i really need this in my project

ghost commented 1 year ago

Yes, it is a different endpoint which only returns the single requested tweet, no replies or the replied-to tweet.

Ah, indeed, I didn't notice there's no replies. As for replied-to tweet, I see there's in_reply_to_status_id_str field to get replied-to tweet, quoted_status_id_str to get quoted tweet, and conversation_id_str to get conversation root tweet (not sure), so it may be solved with another request, I suppose, if needed. Yet, well, it might stay this way, or it might not. We can only observe for now.

nickchen120235 commented 1 year ago

https://github.com/zedeus/nitter/issues/919#issuecomment-1620918391

Maybe this could provide some help?

Disclaimer: This may be broken by the Tesla guy anytime, so proceed with caution

manuelhb14 commented 1 year ago

https://twitter.com/TitterDaily/status/1676624363787894784?s=20

👀

eminekahveci commented 1 year ago

While I was taking data for my thesis, my twitter developer account was suddenly closed. Right now I have little time left and I desperately need my twitter data. I couldn't start the digger here, is it my fault??

PVasyl commented 1 year ago

https://twitter.com/TitterDaily/status/1676624363787894784?s=20

👀

No, it does not work, unfortunately

mukaddim98 commented 1 year ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result

CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

Hi, thanks for the code. This works amazingly. However, instead of scraping a single tweet using the tweet id, I want to get multiple tweets based on a search string, with start and end dates, and a tweet limit of how many tweets it should scrape. What changes do I have to make to the query string? I searched online, but could not find documentation on this. I made an assignment for the students at a university and that depended on the snscrape library. Any guidance is appreciated.

DanRev111212 commented 12 months ago

Any update?

ihabpalamino commented 12 months ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

Hi, thanks for the code. This works amazingly. However, instead of scraping a single tweet using the tweet id, I want to get multiple tweets based on a search string, with start and end dates, and a tweet limit of how many tweets it should scrape. What changes do I have to make to the query string? I searched online, but could not find documentation on this. I made an assignment for the students at a university and that depended on the snscrape library. Any guidance is appreciated.

yes please same question and is it possible to scrape with period of date

sabeehsaeed commented 12 months ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

Hi, thanks for the code. This works amazingly. However, instead of scraping a single tweet using the tweet id, I want to get multiple tweets based on a search string, with start and end dates, and a tweet limit of how many tweets it should scrape. What changes do I have to make to the query string? I searched online, but could not find documentation on this. I made an assignment for the students at a university and that depended on the snscrape library. Any guidance is appreciated.

yes please same question and is it possible to scrape with period of date

AFAIK

  1. As of now, Twitter allows accessing only the Tweet content ( not the associated replies ) at the TweetResultByRestId GraphQL api, without logging in. The mentioned api requires you to have the tweet id beforehand.
  2. To access the "search by query" endpoint you would either have to login before running any scraping task, or use the official TwitterAPI ( the free version does not allow read requests. The $100 basic account allows reading tweets with a monthly cap of 10,000 )
  3. Logging in and querying the search/usertweet endpoints may result in your account getting banned. They are also rate limited at 50 requests per 15 minutes per endpoint.

I'm not going to refer any other tools that allow authentication support but you'll find some online that use the snscrape and twint models and build on top of that to support user authentication and GQL endpoint querying.

NOTE: Anybody attempting to do so should research and understand the risks and liabilities associated with scraping with an authenticated user.

drcarademono commented 12 months ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result

CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

When I use this code on a retweet, it ONLY returns the original tweeter's username in the JSON. Is there any way to return the retweeter's username?

For example, putting this Tweet ID into the code (1499332226177286144, a retweet by AaronBell4NUL) returns information about the original tweet (1499056854688841732, a tweet by NewsNBC). AaronBell4NUL is not reported in the JSON that's returned, even though I entered his Tweet ID. However, I'd like to be able to enter the retweet's Tweet ID and return AaronBell4NUL.

locfinessemonster commented 12 months ago

Here is what I have gathered about what snscrape twitter scraping may look like in the future, if anyone could confirm, deny, or add details that would be awesome.

There is some optimism that it is possible for twitter users to be queried by screen name and that their syndication feeds can be captured. These syndication feeds would consist of up to 20 tweets per user (no retweets or replies) and wouldn't be subject to rate limiting? These updates are pending until the dust on the Twitter changes settles a bit.

itsnotmethistime commented 12 months ago

https://github.com/zedeus/nitter/pull/927/commits

Nitter has updated endpoints, could be useful for people to mess around with.

I'm curious if this is going to just be a never ending cycle of cat and mouse.

Current state of twitter sucks.

ihabpalamino commented 11 months ago

hello guys hello @JustAnotherArchivist any update or news about the issue?

4yR1l1k commented 11 months ago

https://github.com/zedeus/nitter/pull/927/commits

Nitter has updated endpoints, could be useful for people to mess around with.

I'm curious if this is going to just be a never ending cycle of cat and mouse.

Current state of twitter sucks.

Yet, keyword scraping not possible as far as I understand.

ihabpalamino commented 11 months ago

hello guys hello @JustAnotherArchivist any update or news about the issue?

waiting for an answer guys

aditya-bansal-7 commented 11 months ago

when will API endpoint free from locked down ? Any idea ?

ares-osint commented 11 months ago

Nitter keyword scrapping is possible now. I just tried.

JustAnotherArchivist commented 11 months ago

Yeah, I saw, it got implemented in https://github.com/zedeus/nitter/commit/67203a431d9242e2f0d5fbdbfa86ba55f0a4cc54.

JustAnotherArchivist commented 11 months ago

Looks like things have been fairly stable for a few days now. I'm not sure yet when I'll have time to implement the necessary changes, possibly on the weekend.

ihabpalamino commented 11 months ago

thanks for your job @JustAnotherArchivist you are really saving our life

MrCube21 commented 11 months ago

Is there a way to donate for your work ? @JustAnotherArchivist

anasmcl commented 11 months ago

it seems that the nitter keyword research shows the results for the last 10 days only. maybe the new endpoint limit ? hope it will be lifted

https://github.com/zedeus/nitter/issues/938

Unixcision commented 11 months ago

Looks like things have been fairly stable for a few days now. I'm not sure yet when I'll have time to implement the necessary changes, possibly on the weekend.

Hey @JustAnotherArchivist, any update on this? Thanks!

DanRev111212 commented 11 months ago

I literally have my final year project depending on this module. Please save us @JustAnotherArchivist

ghost commented 11 months ago

If you for some reason absolutely need to use snscrape for getting a single tweet, and you don't mind just getting it by tweet id purely in code, you can do it like that:

from snscrape.base import ScraperException
from snscrape.modules.twitter import (
    Tweet,
    _TwitterAPIType,
    TwitterTweetScraper,
    TwitterTweetScraperMode,
)

def get_items(self):
    variables = {
        "tweetId": str(self._tweetId),
        "includePromotedContent": True,
        "withCommunity": True,
        "withVoice": True,
        # !!! these fields may be deprecated
        # "with_rux_injections": False,
        # "withQuickPromoteEligibilityTweetFields": True,
        # "withBirdwatchNotes": False,
        # "withV2Timeline": True,
    }
    features = {
        "responsive_web_graphql_exclude_directive_enabled": True,
        "verified_phone_label_enabled": False,
        "creator_subscriptions_tweet_preview_api_enabled": False,
        "responsive_web_graphql_timeline_navigation_enabled": True,
        "responsive_web_graphql_skip_user_profile_image_extensions_enabled": False,
        "tweetypie_unmention_optimization_enabled": True,
        "responsive_web_edit_tweet_api_enabled": True,
        "graphql_is_translatable_rweb_tweet_is_translatable_enabled": True,
        "view_counts_everywhere_api_enabled": True,
        "longform_notetweets_consumption_enabled": True,
        "tweet_awards_web_tipping_enabled": False,
        "freedom_of_speech_not_reach_fetch_enabled": True,
        "standardized_nudges_misinfo": True,
        "tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled": False,
        "longform_notetweets_rich_text_read_enabled": True,
        "longform_notetweets_inline_media_enabled": False,
        "responsive_web_enhance_cards_enabled": False,
        "responsive_web_twitter_article_tweet_consumption_enabled": False,  # new?
        "responsive_web_media_download_video_enabled": True,  # new?
        # !!! these fields may be deprecated
        # "rweb_lists_timeline_redesign_enabled": False,
        # "vibe_api_enabled": True,
        # "interactive_text_enabled": True,
        # "blue_business_profile_image_shape_enabled": True,
        # "responsive_web_text_conversations_enabled": False,
    }
    fieldToggles = {
        "withArticleRichContentState": True,
        "withAuxiliaryUserLabels": True,
    }
    params = {
        "variables": variables,
        "features": features,
        "fieldToggles": fieldToggles,  # seems optional
    }
    url = "https://twitter.com/i/api/graphql/3HC_X_wzxnMmUBRIn3MWpQ/TweetResultByRestId"
    if self._mode is TwitterTweetScraperMode.SINGLE:
        obj = self._get_api_data(url, _TwitterAPIType.GRAPHQL, params=params)
        if not obj["data"]["tweetResult"]:
            return
        yield self._graphql_timeline_tweet_item_result_to_tweet(
            obj["data"]["tweetResult"]["result"], tweetId=self._tweetId
        )

# replace this method
TwitterTweetScraper.get_items = get_items

def get_using_snscrape(tweet_id: int) -> Tweet | None:
    print("Sending API request...")
    try:
        for tweet in TwitterTweetScraper(tweet_id).get_items():
            print("Response: %r." % tweet)
            return tweet
        print("No response from public API.")
    except ScraperException:
        print("Scraping failed.")

Just pass tweet id to get_using_snscrape function and it will return Tweet instance, if there's anything to return, or None otherwise. Obviously, it is not the best way to do it, but it works at least. You can also adapt results of https://github.com/JustAnotherArchivist/snscrape/issues/996#issuecomment-1615937362 to your needs.

ihabpalamino commented 11 months ago

If you for some reason absolutely need to use snscrape for getting a single tweet, and you don't mind just getting it by tweet id purely in code, you can do it like that:

from snscrape.base import ScraperException
from snscrape.modules.twitter import (
    Tweet,
    _TwitterAPIType,
    TwitterTweetScraper,
    TwitterTweetScraperMode,
)

def get_items(self):
    variables = {
        "tweetId": str(self._tweetId),
        "includePromotedContent": True,
        "withCommunity": True,
        "withVoice": True,
        # !!! these fields may be deprecated
        # "with_rux_injections": False,
        # "withQuickPromoteEligibilityTweetFields": True,
        # "withBirdwatchNotes": False,
        # "withV2Timeline": True,
    }
    features = {
        "responsive_web_graphql_exclude_directive_enabled": True,
        "verified_phone_label_enabled": False,
        "creator_subscriptions_tweet_preview_api_enabled": False,
        "responsive_web_graphql_timeline_navigation_enabled": True,
        "responsive_web_graphql_skip_user_profile_image_extensions_enabled": False,
        "tweetypie_unmention_optimization_enabled": True,
        "responsive_web_edit_tweet_api_enabled": True,
        "graphql_is_translatable_rweb_tweet_is_translatable_enabled": True,
        "view_counts_everywhere_api_enabled": True,
        "longform_notetweets_consumption_enabled": True,
        "tweet_awards_web_tipping_enabled": False,
        "freedom_of_speech_not_reach_fetch_enabled": True,
        "standardized_nudges_misinfo": True,
        "tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled": False,
        "longform_notetweets_rich_text_read_enabled": True,
        "longform_notetweets_inline_media_enabled": False,
        "responsive_web_enhance_cards_enabled": False,
        "responsive_web_twitter_article_tweet_consumption_enabled": False,  # new?
        "responsive_web_media_download_video_enabled": True,  # new?
        # !!! these fields may be deprecated
        # "rweb_lists_timeline_redesign_enabled": False,
        # "vibe_api_enabled": True,
        # "interactive_text_enabled": True,
        # "blue_business_profile_image_shape_enabled": True,
        # "responsive_web_text_conversations_enabled": False,
    }
    fieldToggles = {
        "withArticleRichContentState": True,
        "withAuxiliaryUserLabels": True,
    }
    params = {
        "variables": variables,
        "features": features,
        "fieldToggles": fieldToggles,  # seems optional
    }
    url = "https://twitter.com/i/api/graphql/3HC_X_wzxnMmUBRIn3MWpQ/TweetResultByRestId"
    if self._mode is TwitterTweetScraperMode.SINGLE:
        obj = self._get_api_data(url, _TwitterAPIType.GRAPHQL, params=params)
        if not obj["data"]["tweetResult"]:
            return
        yield self._graphql_timeline_tweet_item_result_to_tweet(
            obj["data"]["tweetResult"]["result"], tweetId=self._tweetId
        )

# replace this method
TwitterTweetScraper.get_items = get_items

def get_using_snscrape(tweet_id: int) -> Tweet | None:
    print("Sending API request...")
    try:
        for tweet in TwitterTweetScraper(tweet_id).get_items():
            print("Response: %r." % tweet)
            return tweet
        print("No response from public API.")
    except ScraperException:
        print("Scraping failed.")

Just pass tweet id to get_using_snscrape function and it will return Tweet instance, if there's anything to return, or None otherwise. Obviously, it is not the best way to do it, but it works at least. You can also adapt results of #996 (comment) to your needs.

there is no way to get tweets by username and date since until isn't it?

TobiasGarbelini commented 11 months ago

Hello, I'm still having this problem (CRITICAL:snscrape.base:Errors: blocked (404)), do you have any solution?

MerleLiuKun commented 11 months ago

It seems that Twitter has once again discontinued access to these APIs. like UserByRestId, UserTweets....

leockl commented 11 months ago

@JustAnotherArchivist if you are implementing snscrape using graphql API like what nitter did, will snscrape also encounter the followng issue with keyword search?:

https://github.com/JustAnotherArchivist/snscrape/issues/996#issuecomment-1635674208

leockl commented 11 months ago

Does anyone know if there is a forked snscrape library that uses Twitter login, or something similar to this what works for keyword search and can scrape historical data?

JustAnotherArchivist commented 11 months ago

@leockl Almost certainly, yes.

ihabpalamino commented 11 months ago

@leockl Almost certainly, yes.

hello @JustAnotherArchivist does the implementation of solution of the blocked(404) gonna be implemented soon?

mouh2020 commented 11 months ago

Is there a library which uses twitter login to scrape?

nerra0pos commented 11 months ago

Update: It seems profiles and the tweets on profiles are available to the public without login again.

@JustAnotherArchivist have you seen that? Do the endpoints work again?

tatta123 commented 11 months ago

Update: It seems profiles and the tweets on profiles are available to the public without login again.

@JustAnotherArchivist have you seen that? Do the endpoints work again?

FR????

lf-11 commented 11 months ago

Thanks for spotting! Yes you can view a profile again but it is very strange. When I google an account and try to access it, not working. When I click on a tweet shown in Google results and from there click on the account, the account's tweets are shown. But strangely, only tweets from before 2023 and not in chronological order. I hardly think this is a "stable" feature of twitter but perhaps a sign that they are currently tweaking some things regarding how one can view a profile without being signed in.

yeahjack commented 11 months ago

Thanks for spotting! Yes you can view a profile again but it is very strange. When I google an account and try to access it, not working. When I click on a tweet shown in Google results and from there click on the account, the account's tweets are shown. But strangely, only tweets from before 2023 and not in chronological order. I hardly think this is a "stable" feature of twitter but perhaps a sign that they are currently tweaking some things regarding how one can view a profile without being signed in.

I do think it is related to UserAgent. Twitter Inc. doesn't want to lose search results from Google since it could damage its influence. In this case, maybe snscrape could change UA to google to bypass its restrictions.

nerra0pos commented 11 months ago

Thanks for spotting! Yes you can view a profile again but it is very strange. When I google an account and try to access it, not working. When I click on a tweet shown in Google results and from there click on the account, the account's tweets are shown. But strangely, only tweets from before 2023 and not in chronological order. I hardly think this is a "stable" feature of twitter but perhaps a sign that they are currently tweaking some things regarding how one can view a profile without being signed in.

Ah yes you're right. If you first go to a single tweet, then traverse to a profile it will work. But also the Tweets shown seem to be "top" tweets or something, not the latest. But certainly worth a look at the Twitter API endpoints to see what this is all about.

JustAnotherArchivist commented 11 months ago

Your User Agent doesn't change like that. But it isn't based on the Referer header either (which would have behaviour like you described, different results from direct navigation vs from web search results). Rather, one of the API endpoints for profile timelines is accessible again, but the page URL still isn't. So if you already have the Twitter website open and click on a profile name, it just triggers that API request and works, but if you open a profile page directly (or refresh it after such an API load), you get the login wall.

Interesting development, yes. There are some complications with implementing this (since snscrape sometimes has to load the profile page to fetch a token), but that can be worked around. The results are very poor though, and the 'Replies' tab (which the twitter-profile scraper is/was using) as well as tweet threads (twitter-tweet with scroll or recurse mode) are still inaccessible.

Also, I've been too busy with life, so I haven't had time to implement any of the changes yet, and I can't currently give any ETA either.

nerra0pos commented 11 months ago

Syndication for Twitter profile works again with the latest tweets:

https://syndication.twitter.com/srv/timeline-profile/screen-name/nypost?showReplies=true

leockl commented 11 months ago

@nerra0pos is there a Python library which works to scrape this Syndication for Twitter site?

zaferk19 commented 11 months ago

Hello, does the profile scraper still work?

ihabpalamino commented 11 months ago

Hello, does the profile scraper still work?

i dont think so but you can wait for confirmation of my answer

ihabpalamino commented 11 months ago

hello @JustAnotherArchivist still there is no solution or implemention for this error?is there any updates?

jessienab commented 11 months ago

Syndication for Twitter profile works again with the latest tweets:

https://syndication.twitter.com/srv/timeline-profile/screen-name/nypost?showReplies=true

Notes about this:

While useless for pulling old data (without crawling through that awful, obfuscated Javascript), new data could be pulled occasionally and consecutively for sources such as newspapers, publishers, etc; no login is needed, nor any login flow issue.

locfinessemonster commented 11 months ago

Does anyone know if the syndication streams are subject to rate limiting?

slavivanov commented 11 months ago

Syndication for Twitter profile works again with the latest tweets: https://syndication.twitter.com/srv/timeline-profile/screen-name/nypost?showReplies=true

Notes about this:

  • Returned data in the HTML, following full load by the Javascript, is application/json , within <script id="__NEXT_DATA__" type="application/json">{}</script
  • Only loads about 20 tweets total.

While useless for pulling old data (without crawling through that awful, obfuscated Javascript), new data could be pulled occasionally and consecutively for sources such as newspapers, publishers, etc; no login is needed, nor any login flow issue.

Sometimes, the tweets in syndication are trimmed. Do you know of a working endpoint or page (without login) to get the full tweet given id and the username?