JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.47k stars 708 forks source link

Unshorten Twitter links automatically #439

Closed TheQuinbox closed 2 years ago

TheQuinbox commented 2 years ago

Often times, when I scrape tweets, it's to find an old link, but nonetheless, one that would still work. However, because of Twitter's stupid URL shortening, this is nearly impossible. Is it possible to make SNScrape auto-unshorten these URLs?

JustAnotherArchivist commented 2 years ago

I'm not entirely sure what you're asking here exactly.

snscrape already provides both the short and full URLs, specifically in the tcooutlinks and outlinks attributes of the Tweet object, respectively. On User objects, there's linkTcourl and linkUrl, respectively.

If you want to resolve a known t.co URL instead, you can just make a request to it with whatever tool you like; it's a simple HTTP 301 redirect (as long as you avoid using a browser-like user agent, then it's an annoying JavaScript thing because of course it is). For example, with requests in Python: import requests; print(requests.get('https://t.co/3qGkhq3Y6I', allow_redirects = False).headers.get('Location'))

(If you want to find the tweets that contain a known t.co URL, you can search for it: snscrape --jsonl twitter-search 'https://t.co/3qGkhq3Y6I')

TheQuinbox commented 2 years ago

Hi, Take the following Json object:

{"_type": "snscrape.modules.twitter.Tweet", "url": "https://twitter.com/samtupy1/status/670691815473958912", "date": "2015-11-28T19:52:44+00:00", "content": "audio link: a fucking rant https://t.co/dkAyMk6gZe via @ButchullaWoman", "renderedContent": "audio link: a fucking rant audioboom.com/boos/3863344-a\u2026 via @ButchullaWoman", "id": 670691815473958912, "user": {"_type": "snscrape.modules.twitter.User", "username": "samtupy1", "id": 3818126897, "displayname": "Sam Tupy", "description": "Developer of the Survive the Wild audiogame as well as other small titles, self-taught sound designer and programmer with an interest in music.", "rawDescription": "Developer of the Survive the Wild audiogame as well as other small titles, self-taught sound designer and programmer with an interest in music.", "descriptionUrls": null, "verified": false, "created": "2015-09-29T21:12:43+00:00", "followersCount": 497, "friendsCount": 91, "statusesCount": 3823, "favouritesCount": 0, "listedCount": 9, "mediaCount": 98, "location": "Texas, USA", "protected": false, "linkUrl": "http://www.samtupy.com", "linkTcourl": "https://t.co/WvgSAW5aJ0", "profileImageUrl": "https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png", "profileBannerUrl": null, "label": null, "url": "https://twitter.com/samtupy1"}, "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 670691815473958912, "lang": "en", "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "sourceUrl": "http://twitter.com", "sourceLabel": "Twitter Web Client", "outlinks": ["https://audioboom.com/boos/3863344-a-fucking-rant?utm_campaign=detailpage&utm_content=retweet&utm_medium=social&utm_source=twitter"], "tcooutlinks": ["https://t.co/dkAyMk6gZe"], "media": null, "retweetedTweet": null, "quotedTweet": null, "inReplyToTweetId": null, "inReplyToUser": null, "mentionedUsers": [{"_type": "snscrape.modules.twitter.User", "username": "ButchullaWoman", "id": 16019116, "displayname": "DarkChocolateTigressAKAKatkat", "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "linkTcourl": null, "profileImageUrl": null, "profileBannerUrl": null, "label": null, "url": "https://twitter.com/ButchullaWoman"}], "coordinates": null, "place": null, "hashtags": null, "cashtags": null}

I want to get the full link of audioboom.com/boos/3863344-a\u2026. \u2026 is actually an elipsis (meaning that's only a small part of the link).

JustAnotherArchivist commented 2 years ago

That's just the renderedContent, which is exactly what you'd see on the web interface, and long URLs are displayed like that there. The actual URL is in outlinks.

TheQuinbox commented 2 years ago

Oh oh, I see. Sorry for the confusion.