JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.49k stars 712 forks source link

some questions #552

Closed ArchivingToolsForWBM closed 2 years ago

ArchivingToolsForWBM commented 2 years ago

I regularly archive tweets to the wayback machine, this means I am getting URLs of tweets, images, gifs and videos (99% of the time are videos posted on twitter and not an external video).

When testing the software, it only output tweet URLs, while saving these URLs to the wayback machine may save these images, the problem is that the URLs on the tweet that point to the image may be a downsized version, if the image is big enough (meaning the WBM saves only the downsized version). This means that I'm only saving tweets and the potentially downsized images. Twitter DOES store the original unmodified resolution having its URL be https://pbs.twimg.com/media/<base64_string>?format=png&name=orig (note the orig at the end of the URL), along with the smaller resolutions generated for use on a displayed tweet for smaller screens. It is just that twitter does not seems to mention this at all.

For other types of media, I know twitter converts gifs to mp4, but not sure about other data conversions. Both video and gifs are mp4, btw

I would like it to output the string like this

https://twitter.com/username/status/123456789012345
 https://pbs.twimg.com/media/<base64_string>?format=png&name=orig
 https://video.twimg.com/tweet_video/<base64_string>.mp4

Note the space before the media content URL.

JustAnotherArchivist commented 2 years ago

This can't be done directly with the snscrape CLI. There are various reasons for this, but mostly the --format syntax just can't cover all the cases of multiple images, retweets, quote tweets, cards, etc. You can certainly do it with a little script though. The information is all extracted and available in the Tweet objects (although the large images use name=large, not name=orig, as that's what Twitter uses for the 'full image' on the web interface and I never saw a difference to orig, but you can replace that in the wrapper obviously if you want). Documentation's still nonexistent, but there are code examples in many of the closed issues that should get you started.

TheTechRobo commented 2 years ago

You could also use the JSONL feature, no? 99% sure that jq can do that.

JustAnotherArchivist commented 2 years ago

Yeah, probably, although I'm not familiar enough with jq to tell what incantations you'd need to use exactly. And yeah, might be preferable to a script.

TheTechRobo commented 2 years ago

Here's what I came up with. For each item in the media field, it extracts the previewUrl and fullUrl.

snscrape --jsonl blablabla | jq '.media[].previewUrl, .media[].fullUrl' | sed 's/"//g'

Keep in mind that the results will be out of order, as it first extracts all the preview URLs, then it extracts all the full URLs, rather than extracting both. This is because the solution to that looks a lot worse (https://stackoverflow.com/a/31418194/9654083)

JustAnotherArchivist commented 2 years ago

That wouldn't handle retweets and quoted tweets though. You need some kind of recursion for that. There are tweets that quote a tweet that quotes a tweet, for example. Quoting and retweeting can also be mixed in some ways.

ArchivingToolsForWBM commented 2 years ago

This can't be done directly with the snscrape CLI. There are various reasons for this, but mostly the --format syntax just can't cover all the cases of multiple images, retweets, quote tweets, cards, etc. You can certainly do it with a little script though. The information is all extracted and available in the Tweet objects (although the large images use name=large, not name=orig, as that's what Twitter uses for the 'full image' on the web interface and I never saw a difference to orig, but you can replace that in the wrapper obviously if you want). Documentation's still nonexistent, but there are code examples in many of the closed issues that should get you started.

There are a few images I couldn't remember that have orig being bigger than large, they must be ridiculously huge, like desktop background huge.

I'm using windows 10. I'm fine if it couldn't get the orig, as all I need is the "base URL" (this part only: https://pbs.twimg.com/media/<base64_string>?format=png), then use notepad++ or a script I already made to convert them to orig. That's okay if it couldn't format them, as long it can get at least the URL, I can just have that in my list to save.

As a side note, does this use twitter's limited API? I'm worried that this will not obtain all available URLs (a term I call it when tweets are not deleted, nor private, not by a user protecting his/her tweets) if it is 7 or more days old.

TheTechRobo commented 2 years ago

This uses the undocumented, private API as used by the web interface. There are some limitations, but not many:

That's all that's been noticed in the past while (not counting guest token ratelimiting because that's been fixed)

JustAnotherArchivist commented 2 years ago

There are a few images I couldn't remember that have orig being bigger than large, they must be ridiculously huge, like desktop background huge.

If you find an example, please let me know. The URL is constructed by snscrape anyway, and I can easily change that of course if it's appropriate. The only reason I went with large is that this is what the Twitter web interface, which snscrape mimics to a degree, uses.

ArchivingToolsForWBM commented 2 years ago

There are a few images I couldn't remember that have orig being bigger than large, they must be ridiculously huge, like desktop background huge.

If you find an example, please let me know. The URL is constructed by snscrape anyway, and I can easily change that of course if it's appropriate. The only reason I went with large is that this is what the Twitter web interface, which snscrape mimics to a degree, uses.

Testing: https://twitter.com/ANN07064061/status/1571210355791040512 From my pc, it's 4096x4096. It's a blank image with the word "TEST" written on it.

large: https://pbs.twimg.com/media/Fc4Pd5YXwAAK0IF?format=png&name=large 2048x2048 (1/2 the original) orig: https://pbs.twimg.com/media/Fc4Pd5YXwAAK0IF?format=png&name=orig 4096x4096 (this is the exact file I posted)

JustAnotherArchivist commented 2 years ago

Interesting, the web interface actually uses https://pbs.twimg.com/media/Fc4Pd5YXwAAK0IF?format=png&name=4096x4096 there for me, not orig.

ArchivingToolsForWBM commented 2 years ago

The WBM also gets the 4096 version as well: https://web.archive.org/web/20220917185443/https://twitter.com/ANN07064061/status/1571210355791040512

really don't know how the web interface automatically chooses a resolution, maybe it downsizes the preview when looking at the tweet on a mobile device (ignoring the thumbnail if you use the one on the top right below the search bar on desktop), but when you click on it to view the bigger resolution, that always picks the full resolution using &name=<width>x<height> notation.

EDIT: Twitter uses something like a "responsive design", as based on the view resolution (from what I seen, it seems to only read the width of the screen), selects which resolution to use (gif below): TW_DynamicRes

Notice the URL flash and briefly highlights when it changes in the inspect element devtools. It also briefly unload since the browser has to load a different version of the image.

ArchivingToolsForWBM commented 2 years ago

You can certainly do it with a little script though. The information is all extracted and available in the Tweet objects

soo, what should I do to make it extract tweets, images, videos and gifs? I'm not too familiar on how to modify it to make it do that? Heck I only understand the gist of using the command prompt.

JustAnotherArchivist commented 2 years ago

I'm afraid I don't have time to write/test/debug the Python script for you. You'd have to use one of the scrapers in snscrape.modules.twitter (depending on what you want to scrape exactly) and then post-process the Tweet objects to print the relevant URLs, with recursion for quoted tweets and retweets. As a starting point, something like this (untested etc.):

import snscrape.modules.twitter

def print_tweet(tweet):
    print(tweet.url)
    if tweet.media:
        for medium in tweet.media:
            if isinstance(medium, snscrape.modules.twitter.Photo):
                print(medium.fullUrl.rsplit('=', 1)[0] + '=orig')  # Replace =large with =orig
            elif isinstance(medium, (snscrape.modules.twitter.Video, snscrape.modules.twitter.Gif)):
                # do something with medium.variants, I guess
    if tweet.retweetedTweet:
        print_tweet(tweet.retweetedTweet)
    if tweet.quotedTweet:
        print_tweet(tweet.quotedTweet)

scraper = snscrape.modules.twitter.TwitterUserScraper('textfiles')
for tweet in scraper.get_items():
    print_tweet(tweet)

Alternatively, you could do a similar thing with the JSONL output in your favourite language instead.