Closed ArchivingToolsForWBM closed 2 years ago
This can't be done directly with the snscrape CLI. There are various reasons for this, but mostly the --format
syntax just can't cover all the cases of multiple images, retweets, quote tweets, cards, etc. You can certainly do it with a little script though. The information is all extracted and available in the Tweet
objects (although the large images use name=large
, not name=orig
, as that's what Twitter uses for the 'full image' on the web interface and I never saw a difference to orig
, but you can replace that in the wrapper obviously if you want). Documentation's still nonexistent, but there are code examples in many of the closed issues that should get you started.
You could also use the JSONL feature, no? 99% sure that jq can do that.
Yeah, probably, although I'm not familiar enough with jq
to tell what incantations you'd need to use exactly. And yeah, might be preferable to a script.
Here's what I came up with. For each item in the media
field, it extracts the previewUrl and fullUrl.
snscrape --jsonl blablabla | jq '.media[].previewUrl, .media[].fullUrl' | sed 's/"//g'
Keep in mind that the results will be out of order, as it first extracts all the preview URLs, then it extracts all the full URLs, rather than extracting both. This is because the solution to that looks a lot worse (https://stackoverflow.com/a/31418194/9654083)
That wouldn't handle retweets and quoted tweets though. You need some kind of recursion for that. There are tweets that quote a tweet that quotes a tweet, for example. Quoting and retweeting can also be mixed in some ways.
This can't be done directly with the snscrape CLI. There are various reasons for this, but mostly the
--format
syntax just can't cover all the cases of multiple images, retweets, quote tweets, cards, etc. You can certainly do it with a little script though. The information is all extracted and available in theTweet
objects (although the large images usename=large
, notname=orig
, as that's what Twitter uses for the 'full image' on the web interface and I never saw a difference toorig
, but you can replace that in the wrapper obviously if you want). Documentation's still nonexistent, but there are code examples in many of the closed issues that should get you started.
There are a few images I couldn't remember that have orig being bigger than large, they must be ridiculously huge, like desktop background huge.
I'm using windows 10. I'm fine if it couldn't get the orig, as all I need is the "base URL" (this part only: https://pbs.twimg.com/media/<base64_string>?format=png
), then use notepad++ or a script I already made to convert them to orig. That's okay if it couldn't format them, as long it can get at least the URL, I can just have that in my list to save.
As a side note, does this use twitter's limited API? I'm worried that this will not obtain all available URLs (a term I call it when tweets are not deleted, nor private, not by a user protecting his/her tweets) if it is 7 or more days old.
This uses the undocumented, private API as used by the web interface. There are some limitations, but not many:
twitter-profile
scraper) includes retweets, but retrieves only 3,200 tweets max.That's all that's been noticed in the past while (not counting guest token ratelimiting because that's been fixed)
There are a few images I couldn't remember that have orig being bigger than large, they must be ridiculously huge, like desktop background huge.
If you find an example, please let me know. The URL is constructed by snscrape anyway, and I can easily change that of course if it's appropriate. The only reason I went with large
is that this is what the Twitter web interface, which snscrape mimics to a degree, uses.
There are a few images I couldn't remember that have orig being bigger than large, they must be ridiculously huge, like desktop background huge.
If you find an example, please let me know. The URL is constructed by snscrape anyway, and I can easily change that of course if it's appropriate. The only reason I went with
large
is that this is what the Twitter web interface, which snscrape mimics to a degree, uses.
Testing: https://twitter.com/ANN07064061/status/1571210355791040512 From my pc, it's 4096x4096. It's a blank image with the word "TEST" written on it.
large: https://pbs.twimg.com/media/Fc4Pd5YXwAAK0IF?format=png&name=large 2048x2048 (1/2 the original) orig: https://pbs.twimg.com/media/Fc4Pd5YXwAAK0IF?format=png&name=orig 4096x4096 (this is the exact file I posted)
Interesting, the web interface actually uses https://pbs.twimg.com/media/Fc4Pd5YXwAAK0IF?format=png&name=4096x4096 there for me, not orig
.
The WBM also gets the 4096 version as well: https://web.archive.org/web/20220917185443/https://twitter.com/ANN07064061/status/1571210355791040512
really don't know how the web interface automatically chooses a resolution, maybe it downsizes the preview when looking at the tweet on a mobile device (ignoring the thumbnail if you use the one on the top right below the search bar on desktop), but when you click on it to view the bigger resolution, that always picks the full resolution using &name=<width>x<height>
notation.
EDIT: Twitter uses something like a "responsive design", as based on the view resolution (from what I seen, it seems to only read the width of the screen), selects which resolution to use (gif below):
Notice the URL flash and briefly highlights when it changes in the inspect element devtools. It also briefly unload since the browser has to load a different version of the image.
You can certainly do it with a little script though. The information is all extracted and available in the Tweet objects
soo, what should I do to make it extract tweets, images, videos and gifs? I'm not too familiar on how to modify it to make it do that? Heck I only understand the gist of using the command prompt.
I'm afraid I don't have time to write/test/debug the Python script for you. You'd have to use one of the scrapers in snscrape.modules.twitter
(depending on what you want to scrape exactly) and then post-process the Tweet
objects to print the relevant URLs, with recursion for quoted tweets and retweets. As a starting point, something like this (untested etc.):
import snscrape.modules.twitter
def print_tweet(tweet):
print(tweet.url)
if tweet.media:
for medium in tweet.media:
if isinstance(medium, snscrape.modules.twitter.Photo):
print(medium.fullUrl.rsplit('=', 1)[0] + '=orig') # Replace =large with =orig
elif isinstance(medium, (snscrape.modules.twitter.Video, snscrape.modules.twitter.Gif)):
# do something with medium.variants, I guess
if tweet.retweetedTweet:
print_tweet(tweet.retweetedTweet)
if tweet.quotedTweet:
print_tweet(tweet.quotedTweet)
scraper = snscrape.modules.twitter.TwitterUserScraper('textfiles')
for tweet in scraper.get_items():
print_tweet(tweet)
Alternatively, you could do a similar thing with the JSONL output in your favourite language instead.
I regularly archive tweets to the wayback machine, this means I am getting URLs of tweets, images, gifs and videos (99% of the time are videos posted on twitter and not an external video).
When testing the software, it only output tweet URLs, while saving these URLs to the wayback machine may save these images, the problem is that the URLs on the tweet that point to the image may be a downsized version, if the image is big enough (meaning the WBM saves only the downsized version). This means that I'm only saving tweets and the potentially downsized images. Twitter DOES store the original unmodified resolution having its URL be
https://pbs.twimg.com/media/<base64_string>?format=png&name=orig
(note theorig
at the end of the URL), along with the smaller resolutions generated for use on a displayed tweet for smaller screens. It is just that twitter does not seems to mention this at all.For other types of media, I know twitter converts gifs to mp4, but not sure about other data conversions. Both video and gifs are mp4, btw
I would like it to output the string like this
Note the space before the media content URL.