JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.4k stars 703 forks source link

Twitter searches in 'top' order fail #647

Closed JustAnotherArchivist closed 1 year ago

JustAnotherArchivist commented 1 year ago

Since sometime today, 'top' searches fail entirely. The API endpoint returns 404.

brihuang99 commented 1 year ago

this is strange...'top=True' is returning results on my desktop, but isn't working on my raspberry pi. not sure what's going on lol

JustAnotherArchivist commented 1 year ago

Interesting. I tested from several machines all over the globe, and it failed in the same way everywhere.

ckosmic commented 1 year ago

Weirdly enough, running the request including all cookies, headers, and params from the browser through python's request.get returns a 404 with a "Sorry, that page does not exist" message, but running it through node.js axios yields a 200 with the expected response. I don't know what discontinuities there are between the two, but that seems to be a limiting factor.

JustAnotherArchivist commented 1 year ago

@brihuang99 What snscrape version do you have on your desktop?

Ramizworking commented 1 year ago

Interesting. I tested from several machines all over the globe, and it failed in the same way everywhere.

What about just using the twitter search bar with since and until commands ?

Okey you won't get it in a chronoligical way but atleast you get the last 1D tweets or 7d tweets.

JustAnotherArchivist commented 1 year ago

@Ramizworking This issue is about snscrape, not Twitter's web interface. I'm aware that they didn't remove the 'top' search, but snscrape can't emulate it currently.

brihuang99 commented 1 year ago

@brihuang99 What snscrape version do you have on your desktop?

I'm running snscrape 0.4.3.20220106. It's the same as what I have on my Raspberry Pi, not sure why only my desktop is working

Ramizworking commented 1 year ago

@Ramizworking This issue is about snscrape, not Twitter's web interface. I'm aware that they didn't remove the 'top' search, but snscrape can't emulate it currently.

image ITS BACK

but it says "your account is not able to perform this action"

JustAnotherArchivist commented 1 year ago

@brihuang99 What are your Python and OpenSSL versions?

brihuang99 commented 1 year ago

@brihuang99 What are your Python and OpenSSL versions?

Python 3.10.6 OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)

Ramizworking commented 1 year ago

@Ramizworking This issue is about snscrape, not Twitter's web interface. I'm aware that they didn't remove the 'top' search, but snscrape can't emulate it currently.

Unfortunately, I refreshed the page and yeah they removed it again. Oh okey ! I thought it was due to their recent fcking update, I don't know if you seen it but they changed "Home" and 'Latest" with 'For you" and "Following"

crscheid commented 1 year ago

I'm seeing the "from:" type searches fail here. Did the /2/search/adaptive.json path get removed at Twitter? I don't see it in the documentation anywhere. Seeing a 404 failure on that request here.

Retrieved https://api.twitter.com/2/search/adaptive.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweets=true&q=from%3AAdyen&tweet_search_mode=live&count=100&query_source=spelling_expansion_revert_click&pc=1&spelling_corrections=1&ext=mediaStats%2ChighlightedLabel: 404
JustAnotherArchivist commented 1 year ago

@crscheid Yes, the 404 is what this issue's about, although you're not using the 'top' order.

crscheid commented 1 year ago

I believe I added top=True after the query field in my call with same result. I’ll check again when I get in front of the code again.

Either way looks like the issue is well known here. I’ll follow the issue for any updates. Thanks!

JoeCello commented 1 year ago

marked as off topic

Ajaxclopidia commented 1 year ago

I realize the issue is regarding the twitter-search module but if people are having issues pulling tweets from a users account I've been able to use the CLI tool to pull back tweets from a users account via the 'twitter-profile' submodule. Like may others the search module fails, so any kind of text search does not work and returns the same errors as everyone else. However,

snscrape --jsonl --progress "twitter-profile" "some_user_account"

pulls the tweets from an account in reverse chronological order. I get a return of 3243 tweets. Looking at the web though that user has about 3900 tweets. I'm aware that the v2 api returns 3200 tweets from a users profile so I'm not sure if I added any value here. The particular account goes back to 2015, and the earliest tweets I got were from 2017.

I usually use SNscrape scrape to just grab the ID's and then use the tweet ids with the api to get more detailed info. For instance, the 'impression_count' is returned in the public metrics now, so that is returned where applicable.

Def a bummer the text search isn't working, but for people trying to get tweets from a user account this seems to be working atm.

how can I run this in python or jupyter notebook

laymonage commented 1 year ago

I don't use snscrape, but I encountered the same problem and I think I've figured out the issue: Twitter's API dropped support for HTTP/1.1. You need to use HTTP/2 or up, but requests (which snscrape uses) doesn't support it. You can switch to httpx instead, which supports HTTP/2 with requests-like interface.

In my code, I just had to change import requests to import httpx and requests.get to httpx.get. After that, my API calls started working again.

JustAnotherArchivist commented 1 year ago

@laymonage They definitely still accept HTTP/1.1. Instead, it has to do with TLS fingerprinting. Cf. @AntoinePaix's comment in #648. httpx probably uses different ciphers by default.

nv78 commented 1 year ago

Is there any update on #634, or my issue request? What happened?

Mr-Freewan commented 1 year ago

It works for me (with top=True). Linux, Windows. Machines in different parts of the world.

JustAnotherArchivist commented 1 year ago

@Mr-Freewan Which OpenSSL version?

Mr-Freewan commented 1 year ago

@Mr-Freewan Which OpenSSL version?

Old =) 1.1.1f

JustAnotherArchivist commented 1 year ago

I was also testing with an old 1.1.1 earlier without success. It likely depends on the exact cipher selection, which has changed several times even within the 1.1.1 series I think.

JustAnotherArchivist commented 1 year ago

I have an implementation of @AntoinePaix's suggestion that works on my system, but I want to test different combinations of Python and OpenSSL to make sure it's reliable before committing. Soon™...

Mr-Freewan commented 1 year ago

waiting for this =)

nv78 commented 1 year ago

Me too, is there any update on #650

JustAnotherArchivist commented 1 year ago

The OpenSSL testing is more complicated than I expected. I've made decent progress, but it'll have to wait a bit longer. The good news is that this can all be reused for the test suite in the future.

Neo6163 commented 1 year ago

@JustAnotherArchivist, We are all counting on you. And thanks for all the support you have provided everyone here. Kudos!!

msamancioglu commented 1 year ago

Hi, because twitter removed "Top" tab, some params/settings of requests become invalid. Removing following settings worked for me (in the twitter.py)

  1. remove ('f': 'live') from baseUrl:

super().init(baseUrl = 'https://twitter.com/search?src=typed_query&' + urllib.parse.urlencode({'f': 'live', 'lang': 'en', 'q': query, 'sr

  1. remove ('tweet_search_mode': 'live') from paginationParams

What do you think @JustAnotherArchivist ?

Neo6163 commented 1 year ago

@msamancioglu, where should I specify these parameters in the following query? for i,tweet in enumerate(sntwitter.TwitterSearchScraper('COVID since:2021-05-01 until:2021-05-31').get_items()):

kareemrasheed89 commented 1 year ago

It works for me (with top=True). Linux, Windows. Machines in different parts of the world.

How did it worked

kareemrasheed89 commented 1 year ago

@msamancioglu, where should I specify these parameters in the following query? for i,tweet in enumerate(sntwitter.TwitterSearchScraper('COVID since:2021-05-01 until:2021-05-31').get_items()):

did you get a solution now????

SomKen commented 1 year ago

The OpenSSL testing is more complicated than I expected. I've made decent progress, but it'll have to wait a bit longer. The good news is that this can all be reused for the test suite in the future.

Any way we can help / give back? I'll contact my local Congress person even!

JustAnotherArchivist commented 1 year ago

@SomKen Good luck with getting them to understand what 'OpenSSL' or 'test suite' even mean. ;-)

Ramizworking commented 1 year ago

@SomKen Good luck with getting them to understand what 'OpenSSL' or 'test suite' even mean. ;-)

Boss hope you doing well! How are the test going so far ?

Have a nice day

samigonza3 commented 1 year ago

I'm also waiting for a solution. Thanks!!

dadiaz14 commented 1 year ago

open any solution? good afternoon Boss

rezzamohammad commented 1 year ago

Hey guys, I haven't made any changes on my code since 2 days ago even since I retry to add the top=True argument but still does not work, but I don't understand why snscrape is working for now, did the @JustAnotherArchivist make the latest changes or Elon Musk (Twitter CEO) rollback from their decision? Haven't tried it on my other machine yet.

But on raw JSON data, it looks like never before, I can't do formatting in https://jsongrid.com/json-formatter for readability such as error "bad string".

But awesome work for the whole progress!

Here the code

import snscrape.modules.twitter as sntwitter
import datetime
import json

date_start = str(datetime.datetime.today().date() - datetime.timedelta(days=1))
hashtags = "solana"

scraper = sntwitter.TwitterSearchScraper(f"#{hashtags} since:{date_start}").get_items()

for i_scraping, tweet_data in enumerate(scraper):
    tweet = json.loads(tweet_data.json())

    print(tweet)

    if i_scraping >= 5:
        break

Output json.

{'_type': 'snscrape.modules.twitter.Tweet', 'url': 'https://twitter.com/NFTTalkShow/status/1613640500153815040', 'date': '2023-01-12T20:54:09+00:00', 'rawContent': 'What are the top nft marketplaces and what blockchains do they cater to?\n\n#Web3 #cryptocurrency #nft #nftart #NFTCommunity #ethereum #solana #tezos #4u #opensea #rarible #magiceden https://t.co/wfddUS3Lmb', 'renderedContent': 'What are the top nft marketplaces and what blockchains do they cater to?\n\n#Web3 #cryptocurrency #nft #nftart #NFTCommunity #ethereum #solana #tezos #4u #opensea #rarible #magiceden https://t.co/wfddUS3Lmb', 'id': 1613640500153815040, 'user': {'_type': 'snscrape.modules.twitter.User', 'username': 'NFTTalkShow', 'id': 1428513977634463746, 'displayname': 'NFT Talk Show 🎙️Podcast', 'rawDescription': 'Ranked Top 5 Podcasts for Web3, NFTs and Crypto. Available on Apple, Spotify & more. Mint our community token - https://t.co/ewIvM4aQC7', 'renderedDescription': 'Ranked Top 5 Podcasts for Web3, NFTs and Crypto. Available on Apple, Spotify & more. Mint our community token - app.manifold.xyz/c/nfttalkshow', 'descriptionLinks': [{'_type': 'snscrape.modules.twitter.TextLink', 'text': 'app.manifold.xyz/c/nfttalkshow', 'url': 'https://app.manifold.xyz/c/nfttalkshow', 'tcourl': 'https://t.co/ewIvM4aQC7', 'indices': [112, 135]}], 'verified': False, 'created': '2021-08-20T00:27:36+00:00', 'followersCount': 438, 'friendsCount': 47, 'statusesCount': 542, 'favouritesCount': 165, 'listedCount': 7, 'mediaCount': 36, 'location': 'Metaverse', 'protected': False, 'link': {'_type': 'snscrape.modules.twitter.TextLink', 'text': 'nfttalkshow.com', 'url': 'http://nfttalkshow.com', 'tcourl': 'https://t.co/0C0CwxaG4a', 'indices': [0, 23]}, 'profileImageUrl': 'https://pbs.twimg.com/profile_images/1609408253460647937/eeBueATk_normal.jpg', 'profileBannerUrl': 'https://pbs.twimg.com/profile_banners/1428513977634463746/1672548154', 'label': None, 'url': 'https://twitter.com/NFTTalkShow'}, 'replyCount': 0, 'retweetCount': 0, 'likeCount': 0, 'quoteCount': 0, 'conversationId': 1613640500153815040, 'lang': 'en', 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'sourceUrl': 'http://twitter.com/download/iphone', 'sourceLabel': 'Twitter for iPhone', 'links': None, 'media': [{'_type': 'snscrape.modules.twitter.Video', 'thumbnailUrl': 'https://pbs.twimg.com/ext_tw_video_thumb/1613640338580844545/pu/img/QrEklq5Xx6Zn4V78.jpg', 'variants': [{'_type': 'snscrape.modules.twitter.VideoVariant', 'contentType': 'video/mp4', 'url': 'https://video.twimg.com/ext_tw_video/1613640338580844545/pu/vid/480x852/LHyklatMYB91E_g-.mp4?tag=12', 'bitrate': 950000}, {'_type': 'snscrape.modules.twitter.VideoVariant', 'contentType': 'video/mp4', 'url': 'https://video.twimg.com/ext_tw_video/1613640338580844545/pu/vid/320x568/scm1d3eJ-crkPppU.mp4?tag=12', 'bitrate': 632000}, {'_type': 'snscrape.modules.twitter.VideoVariant', 'contentType': 'video/mp4', 'url': 'https://video.twimg.com/ext_tw_video/1613640338580844545/pu/vid/720x1280/3vMWEXgMttsBa9dV.mp4?tag=12', 'bitrate': 2176000}, {'_type': 'snscrape.modules.twitter.VideoVariant', 'contentType': 'application/x-mpegURL', 'url': 'https://video.twimg.com/ext_tw_video/1613640338580844545/pu/pl/X7HLC2ovv3Lq73Bx.m3u8?tag=12&container=fmp4', 'bitrate': None}], 'duration': 46.433, 'views': 0}], 'retweetedTweet': None, 'quotedTweet': None, 'inReplyToTweetId': None, 'inReplyToUser': None, 'mentionedUsers': None, 'coordinates': None, 'place': None, 'hashtags': ['Web3', 'cryptocurrency', 'nft', 'nftart', 'NFTCommunity', 'ethereum', 'solana', 'tezos', '4u', 'opensea', 'rarible', 'magiceden'], 'cashtags': None, 'card': None}
SomKen commented 1 year ago

https://twitter.com/somken/status/1613658821163094017/photo/1

Pretty sure Twitter fixed something... I haven't touched anything

Ramizworking commented 1 year ago

https://twitter.com/somken/status/1613658821163094017/photo/1

Pretty sure Twitter fixed something... I haven't touched anything

Still not working from my side ser !

JustAnotherArchivist commented 1 year ago

Can't reproduce that. Without the TLS cipher override, it still fails the same way as before. If you are talking about the actual official API, snscrape doesn't use that.

Ramizworking commented 1 year ago

Can't reproduce that. Without the TLS cipher override, it still fails the same way as before. If you are talking about the actual official API, snscrape doesn't use that.

Yeah not really related to the acutal problem. On our side, how's it going on boss ?

SomKen commented 1 year ago

Can't reproduce that. Without the TLS cipher override, it still fails the same way as before. If you are talking about the actual official API, snscrape doesn't use that.

I'm using "3.0.2-0ubuntu1.7" in a Ubuntu 22.04 container. I'm also using an old version in PIP "snscrape==0.4.3.20220106" but everything is back to normal for me. I'm seeing the same amount of tweets as I did before coming in.

Still not sure why I had a huge spike in tweets coming in before the API died, but I haven't looked into the tweets returned yet

image

ckosmic commented 1 year ago

Same here. Copying the full request's headers, cookies, params and running through httpx gives me a normal response with a guest session.

JustAnotherArchivist commented 1 year ago

Right, so it's the OpenSSL version/cipher config thing already mentioned early on here, most likely.

Idan-Garay commented 1 year ago

I use google colab to scrape twitter and it works on my end and @brihuang99 also commented

I'm running snscrape 0.4.3.20220106. It's the same as what I have on my Raspberry Pi, not sure why only my desktop is working

maybe it has something to do with python version?

I'm using these versions and it currently works on my end Python 3.8.16 snscrape-0.4.3.20220106

image
nv78 commented 1 year ago

weird - this still does not work for me

SomKen commented 1 year ago
root@c39d9e04385c:/# pip freeze
beautifulsoup4==4.11.1
certifi==2022.12.7
charset-normalizer==2.1.1
DateTime==4.7
dbus-python==1.2.18
idna==3.4
lxml==4.9.2
pika==1.3.0
PyGObject==3.42.1
PySocks==1.7.1
pytz==2022.7
PyYAML==6.0
requests==2.28.1
snscrape==0.4.3.20220106
soupsieve==2.3.2.post1
timedelta==2020.12.3
tzdata==2022.7
urllib3==1.26.13
zope.interface==5.5.2
root@c39d9e04385c:/# python3 --version
Python 3.10.6

If this helps.

JustAnotherArchivist commented 1 year ago

Testing has finally finished. This was a pain to get running. Looks like there aren't many people testing Python against different versions of OpenSSL...

Without extra measures, it's all about the OpenSSL version; the Python version does not matter, and I expect none of the Python packages do either.

My patch works with all tested Python (3.8.16, 3.9.16, 3.10.9, and 3.11.1) and OpenSSL (above) versions, excluding incompatible combinations of course (e.g. Python 3.10+ requires OpenSSL 1.1.1+).

Committing shortly.

PCJ-001 commented 1 year ago

Testing has finally finished. This was a pain to get running. Looks like there aren't many people testing Python against different versions of OpenSSL...

Without extra measures, it's all about the OpenSSL version; the Python version does not matter, and I expect none of the Python packages do either.

  • Works: OpenSSL 3.0.7 and 1.1.0l
  • Does not work: OpenSSL 1.1.1q and 1.0.2u

My patch works with all tested Python (3.8.16, 3.9.16, 3.10.9, and 3.11.1) and OpenSSL (above) versions, excluding incompatible combinations of course (e.g. Python 3.10+ requires OpenSSL 1.1.1+).

Committing shortly.

Still not working