Closed TheQuinbox closed 2 years ago
I'm not entirely sure what you're asking here exactly.
snscrape already provides both the short and full URLs, specifically in the tcooutlinks
and outlinks
attributes of the Tweet
object, respectively. On User
objects, there's linkTcourl
and linkUrl
, respectively.
If you want to resolve a known t.co URL instead, you can just make a request to it with whatever tool you like; it's a simple HTTP 301 redirect (as long as you avoid using a browser-like user agent, then it's an annoying JavaScript thing because of course it is). For example, with requests in Python: import requests; print(requests.get('https://t.co/3qGkhq3Y6I', allow_redirects = False).headers.get('Location'))
(If you want to find the tweets that contain a known t.co URL, you can search for it: snscrape --jsonl twitter-search 'https://t.co/3qGkhq3Y6I'
)
Hi, Take the following Json object:
{"_type": "snscrape.modules.twitter.Tweet", "url": "https://twitter.com/samtupy1/status/670691815473958912", "date": "2015-11-28T19:52:44+00:00", "content": "audio link: a fucking rant https://t.co/dkAyMk6gZe via @ButchullaWoman", "renderedContent": "audio link: a fucking rant audioboom.com/boos/3863344-a\u2026 via @ButchullaWoman", "id": 670691815473958912, "user": {"_type": "snscrape.modules.twitter.User", "username": "samtupy1", "id": 3818126897, "displayname": "Sam Tupy", "description": "Developer of the Survive the Wild audiogame as well as other small titles, self-taught sound designer and programmer with an interest in music.", "rawDescription": "Developer of the Survive the Wild audiogame as well as other small titles, self-taught sound designer and programmer with an interest in music.", "descriptionUrls": null, "verified": false, "created": "2015-09-29T21:12:43+00:00", "followersCount": 497, "friendsCount": 91, "statusesCount": 3823, "favouritesCount": 0, "listedCount": 9, "mediaCount": 98, "location": "Texas, USA", "protected": false, "linkUrl": "http://www.samtupy.com", "linkTcourl": "https://t.co/WvgSAW5aJ0", "profileImageUrl": "https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png", "profileBannerUrl": null, "label": null, "url": "https://twitter.com/samtupy1"}, "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 670691815473958912, "lang": "en", "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "sourceUrl": "http://twitter.com", "sourceLabel": "Twitter Web Client", "outlinks": ["https://audioboom.com/boos/3863344-a-fucking-rant?utm_campaign=detailpage&utm_content=retweet&utm_medium=social&utm_source=twitter"], "tcooutlinks": ["https://t.co/dkAyMk6gZe"], "media": null, "retweetedTweet": null, "quotedTweet": null, "inReplyToTweetId": null, "inReplyToUser": null, "mentionedUsers": [{"_type": "snscrape.modules.twitter.User", "username": "ButchullaWoman", "id": 16019116, "displayname": "DarkChocolateTigressAKAKatkat", "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "linkTcourl": null, "profileImageUrl": null, "profileBannerUrl": null, "label": null, "url": "https://twitter.com/ButchullaWoman"}], "coordinates": null, "place": null, "hashtags": null, "cashtags": null}
I want to get the full link of audioboom.com/boos/3863344-a\u2026. \u2026 is actually an elipsis (meaning that's only a small part of the link).
That's just the renderedContent
, which is exactly what you'd see on the web interface, and long URLs are displayed like that there. The actual URL is in outlinks
.
Oh oh, I see. Sorry for the confusion.
Often times, when I scrape tweets, it's to find an old link, but nonetheless, one that would still work. However, because of Twitter's stupid URL shortening, this is nearly impossible. Is it possible to make SNScrape auto-unshorten these URLs?