JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.3k stars 698 forks source link

All Twitter scrapes are failing: `blocked (404)` #996

Open JustAnotherArchivist opened 1 year ago

JustAnotherArchivist commented 1 year ago

With the exception of twitter-trends, all Twitter scrapes are failing since sometime in the past hour. This is likely connected to Twitter as a whole getting locked behind a login wall since earlier today. There is no known workaround at this time, and it's not known whether this will be fixable.

yeahjack commented 1 year ago

So sad :-( My research project is strongly related to this lib, and pay tribute to your effort in maintaining this.

viktorzen commented 1 year ago

Twitter disabled their public web site today (2023-06-30) and require users to login, twitter used to be public prior to this date. Would it be possible to automate the login as well providing a username and pw to snscrape, i.e. before calling a graphql api to login to twitter and simulate a logged-in session?

yeahjack commented 1 year ago

I do not think the developer would do this, as he said that auth would never be added into features: see #270 . Let's see what our great developers' solution, hope it would not take long.

enzoferey commented 1 year ago

Before using this library, I had started doing manual scrapping myself using Puppeteer and I had automated the sign in part (even through 2FA). The issue is that if you frequently sign in in a small period of time you get blocked by Twitter and you cannot sign in again for a certain amount of time. So I'm not sure what the ideal setup would be in this case...

midnightmagic commented 1 year ago

If this comment is off-topic, please consider deleting it. Uh. It was mentioning Twitter failing in this regard, not you. btw.

midnightmagic commented 1 year ago

Please consider deleting my prior off-topic comment.

Don't nuke this one as off-topic: A Twitter employee says it's temporary:

https://twitter.com/AqueelMiq/status/1674843555486134272 "this is a temporary restriction, we will re-enable logged out twitter access in the near future"

Wouze commented 1 year ago

Elon talked about it too 💀 https://twitter.com/elonmusk/status/1674942336583757825

akanachuu commented 1 year ago

can i use my personal oauth key to twitter snscrape ?

khorg0sh commented 1 year ago

Elon talked about it too 💀 https://twitter.com/elonmusk/status/1674942336583757825

Musk referred to EXTREME scraping, indicating that scrapers may no longer be functional post changes. Let's see how it is done.

akanachuu commented 1 year ago

can i edited the "twitter.py" modules w/ my own bearer key or event oauth login key? (locally, at my computer when i installed snscraper module) since it change to my local snscraper module ? thanks image_2023-07-01_153433286

Benniepie commented 1 year ago

Hello,

This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505

Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web/*/https://twitter.com/tesla/status*)

If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure.

Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated

Ben

arfathyahiya commented 1 year ago

Hello,

This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505

Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status)

If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure.

Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated

Ben

URL: https://cdn.syndication.twimg.com/tweet-result

CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

yeahjack commented 1 year ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result

CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

This seems to be working, the problem might be the rate limit and stability, more tests are needed.

dadiaz1424 commented 1 year ago

It does not allow you to see all the followed by a user either, would there be a solution for that? they help me?

Write commented 1 year ago

https://twitter.com/elonmusk/status/1675187969420828672

😂

@elonmusk To address extreme levels of data scraping & system manipulation, we’ve applied the following temporary limits:

  • Verified accounts are limited to reading 6000 posts/day
  • Unverified accounts to 600 posts/day
  • New unverified accounts to 300/day
Fa5g commented 1 year ago

My IP was banned although I was using a proxy that change the IP dynamically, what options we have now?

MazenTayseer commented 1 year ago

@JustAnotherArchivist Are the scrapers working anytime soon? Also, I want to thank you for your hard work on these scrapers.

Fa5g commented 1 year ago

Scraping seems to be still possible, check this:

https://rss-bridge.org/bridge01/?action=display&bridge=TwitterBridge&context=By+username&u=elonmusk&format=html

https://rss-bridge.org/bridge01/?action=display&bridge=TwitterBridge&context=By+username&u=elonmusk&format=json

By https://github.com/RSS-Bridge/rss-bridge

Write commented 1 year ago

Scraping seems to be still possible, check this:

https://rss-bridge.org/bridge01/?action=display&bridge=TwitterBridge&context=By+username&u=elonmusk&format=html

https://rss-bridge.org/bridge01/?action=display&bridge=TwitterBridge&context=By+username&u=elonmusk&format=json

By https://github.com/RSS-Bridge/rss-bridge

while cool, it's using API V1 and you can't get long tweet

MrCube21 commented 1 year ago

hi guys im new to github and coding but maybe this is helpful

https://twitter.com/iam4x/status/1675194767854956546?s=20

Write commented 1 year ago

hi guys im new to github and coding but maybe this is helpful

https://twitter.com/iam4x/status/1675194767854956546?s=20

This doesn't work since a long time ago.

MahmuudNabil commented 1 year ago

what about using Selenium first to make a login after that use Sntwitter to get tweets? the question here is how can link between Selenium session with Sntwitter?

erikcas commented 1 year ago

hi guys im new to github and coding but maybe this is helpful https://twitter.com/iam4x/status/1675194767854956546?s=20

This doesn't work since a long time ago.

lol this seems to be working, na never mind, besides it was fun for some minutes, it messes up the rest of the features so no lol after all

ohhdemgirls commented 1 year ago

what about using Selenium first to make a login after that use Sntwitter to get tweets? the question here is how can link between Selenium session with Sntwitter?

The beauty ofsnscrapeis that it doesn't require authentication, if we're going to have to start using login/auth and tools like Selenium then it should be spun off into another project and not snscrape. Also using any form of auth gives twitter another way to ban mass collection which is the use case for many users of snscrape.

PanMiko commented 1 year ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result

CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

Hi! :)) It works great! Is there perhaps any way to scrape repost and comment data as well? I need a mapping of twitt spread for my master thesis, but what companies are doing lately with their API (like Twitter or Reddit) is terrible....

saad-15art commented 1 year ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

Hi! :)) It works great! Is there perhaps any way to scrape repost and comment data as well? I need a mapping of twitt spread for my master thesis, but what companies are doing lately with their API (like Twitter or Reddit) is terrible....

You are describing my situation now I need the comments for the same purpose please let me know when you find a solution my submission in September

IrtzaShahan commented 1 year ago

what about using Selenium first to make a login after that use Sntwitter to get tweets? the question here is how can link between Selenium session with Sntwitter?

The beauty ofsnscrapeis that it doesn't require authentication, if we're going to have to start using login/auth and tools like Selenium then it should be spun off into another project and not snscrape. Also using any form of auth gives twitter another way to ban mass collection which is the use case for many users of snscrape.

So you would rather have it completely stop working for all other use cases as well?

TheTechRobo commented 1 year ago

@IrtzaShahan #270

nerra0pos commented 1 year ago

Would be great if snscrape would add a new function like TwitterProfileScraperSyn that grabs the tweet data from the still publicly available syndication profile feeds. The sny feed shows 20 tweets with is good for many applications.

Miandari commented 1 year ago

Insomnia

Great!

Is there any other param I can put in querystring except the tweet id? I want to get tweets for specific users, but can't find what params should I use.

JustAnotherArchivist commented 1 year ago
ohhdemgirls commented 1 year ago

what about using Selenium first to make a login after that use Sntwitter to get tweets? the question here is how can link between Selenium session with Sntwitter?

The beauty ofsnscrapeis that it doesn't require authentication, if we're going to have to start using login/auth and tools like Selenium then it should be spun off into another project and not snscrape. Also using any form of auth gives twitter another way to ban mass collection which is the use case for many users of snscrape.

So you would rather have it completely stop working for all other use cases as well?

Yes (for twitter), and I expressed why and so has JustAnotherArchivist#issuecomment-1616774736 / #270

TianzhuQin commented 1 year ago

May I please ask how we can have a specific user's tweets from the start time to the end time for now? Really in a hurry and currently have no clues....

And this one seems to have no params for screen name? Do we have other urls? https://cdn.syndication.twimg.com/tweet-result

Thank you for all your help, and many great praise to the author @JustAnotherArchivist

ChowSings commented 1 year ago

Broke by Musk

ihabpalamino commented 1 year ago

i hope a solution would be found soon i really need this libs its for my final studies project otherwise i could fail...

pleblira commented 1 year ago

Does anyone know if someone's working on a snscraper fork that implements login/auth for Twitter?

Really appreciate your work, JustAnotherArchivist, thank you for all you do. Hoping Elon pulls back some of the restriction and we can have snscrape working as original! Best wishes

codilau commented 1 year ago

@pleblira this library uses the SNScrape classes for User and Tweet and supports auth https://github.com/vladkens/twscrape

nbrahmani commented 1 year ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result

CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

What's the URL to see the user profile? Sorry if it's a dumb question, but I could not find any reference on the net.

nickchen120235 commented 1 year ago

@nbrahmani You can try

https://syndication.twitter.com/srv/timeline-profile/screen-name/[username]

For example, https://syndication.twitter.com/srv/timeline-profile/screen-name/elonmusk

User info will be stored inside the <script id="__NEXT_DATA__> tag. The tag itself is server-side rendered so you can use requests with BeautifulSoup (assume using Python) to extract the data you need. You can get the user profile and up to 20 most recent tweets from that user.

Unfortunately this endpoint has been dead since 16, 17 hours ago.

nbrahmani commented 1 year ago

@nickchen120235

I tried this, but I need the twitter blue status of a user, and this does not return that.

nickchen120235 commented 1 year ago

@nickchen120235

I tried this, but I need the twitter blue status of a user, and this does not return that.

There's a boolean is_blue_verified or something similar in the user key iirc. Maybe that's what you need?

nbrahmani commented 1 year ago

@nickchen120235 I tried this, but I need the twitter blue status of a user, and this does not return that.

There's a boolean is_blue_verified or something similar in the user key iirc. Maybe that's what you need?

As far as I can see, it does not have that boolean. I get the following output:

{"props":{"pageProps":{"contextProvider":{"features":{},"scribeData":{"client_version":null,"dnt":false,"widget_id":"embed-0","widget_origin":"","widget_frame":"","widget_partner":"","widget_site_screen_name":"","widget_site_user_id":"","widget_creator_screen_name":"","widget_creator_user_id":"","widget_iframe_version":"bb06567:1687853948269","widget_data_source":"screen-name:elonmusk","session_id":""},"messengerContext":{"embedId":"embed-0"},"hasResults":true,"lang":"en","theme":"light"},"lang":"en","maxHeight":null,"showHeader":true,"hideBorder":false,"hideFooter":false,"hideScrollBar":false,"transparent":false,"timeline":{"entries":[]},"headerProps":{"screenName":"elonmusk"}},"__N_SSP":true},"page":"/timeline-profile/screen-name/[screenName]","query":{"screenName":"elonmusk"},"buildId":"vn5fUacsNpP-nIkFRlFf6","assetPrefix":"https://platform.twitter.com","isFallback":false,"gssp":true,"customServer":true}

nickchen120235 commented 1 year ago

@nbrahmani Sorry for the confusion 😓

As I mentioned earlier this endpoint is dead, so it's no longer outputting the correct response.

If it were working, the info you need would be in the user key in one of the entries.

ihabpalamino commented 12 months ago

hello guys hello @JustAnotherArchivist any update about the issue?

nickchen120235 commented 12 months ago

AFAIK

  1. The login wall is still there.
  2. Single embedded tweet works, but embedded timeline doesn't. (You can try at https://publish.twitter.com)
  3. Authentication won't be implemented anyway.
prasunshrestha commented 12 months ago

Hello, This may or may not help. Here's a route to access Tweets without logging in (contains further iframe to platform.twitter.com): https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https://twitter.com/elonmusk/status/1674865731136020505 Would combining this with a pre-existing list of Tweets allow data scraping to continue? Alternatively users could build the tweet list using google search, e.g. for Tesla tweets: "site:twitter.com/tesla/status" or via another cached list (e.g. Waybackmachine - https://web.archive.org/web//https://twitter.com/tesla/status) If I'm off the mark, I apologise but thought I'd pass this on, on the off chance it may help at least as a temporary measure. Just a note to @JustAnotherArchivist - thank you for the hard work you have put into this library - it is very much appreciated Ben

URL: https://cdn.syndication.twimg.com/tweet-result

CODE:

import requests

url = "https://cdn.syndication.twimg.com/tweet-result"

querystring = {"id":"1652193613223436289","lang":"en"}

payload = ""
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0",
    "Accept": "*/*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Origin": "https://platform.twitter.com",
    "Connection": "keep-alive",
    "Referer": "https://platform.twitter.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "Pragma": "no-cache",
    "Cache-Control": "no-cache",
    "TE": "trailers"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Generated by Insomnia

Can we get the IDs of the post generated by a specific profile? If a single embedded tweet is working, a for-loop through all the IDs will work in the interim. Thank you!

ihabpalamino commented 12 months ago

what is this code for ? @prasunshrestha

ghost commented 12 months ago

As of now it seems to be possible to view public tweets without logging in. Wayback Machine can save tweet pages again. Current snscrape scraping methods still return 404, so it's likely that API endpoints or something else has been changed.

Can't confirm anything more than that for now.

JustAnotherArchivist commented 12 months ago

Yes, it is a different endpoint which only returns the single requested tweet, no replies or the replied-to tweet.

ihabpalamino commented 12 months ago

Yes, it is a different endpoint which only returns the single requested tweet, no replies or the replied-to tweet.

is it already implemented? if yes wich version should i update or have then