bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
551 stars 55 forks source link

Archive non-video media (images and sound) #3

Closed loganwilliams closed 2 years ago

loganwilliams commented 3 years ago

Currently, auto-archiver relies on youtube-dl to download media, which only finds video sources. It would be a significant improvement to download images, and possibly audio as well.

jamesarnall commented 3 years ago

Hi! I saw your "help wanted" tag. What is the user story or expected behavior here? I'm guessing that links to images or sound files are added to the spreadsheet, and the script retrieves and archives those and updates the spreadsheet - just like how video files are currently handled?

loganwilliams commented 3 years ago

Hi @jamesarnall, thanks for your interest in contributing! I apologize that this issue is only a "stub" of an idea so far, design decisions like the expected behavior still need to be worked out. (And you are more than welcome to help us do so, if you would like.)

Currently, from a user's perspective, when a link is added that contains a video that can be fetched with youtube-dl, that video is extracted and uploaded to a Digital Ocean S3 space.

When a link contains, e.g., a Twitter post with an image, the entire linked URL is archived with the Wayback Machine's Save Page Now API. While in many cases this is sufficient, in some circumstances it can result in archiving only a low resolution image file (if, for example, the full resolution image is hidden behind a mouse click from the original URL and this is not picked up by the Wayback Machine.) It would be better if the images were extracted and saved at maximum resolution, perhaps in addition to saving the original URL in the Wayback Machine. There are some extant projects for downloading images from a gallery URL, e.g. https://github.com/mikf/gallery-dl/.

Displaying the archived result to the user would require a 1:many relationship between spreadsheet rows and archived items, as also documented in issue #7.

This is my current thinking on this issue, but again I'm open to your thoughts and ideas on how to make this a good user experience.

kernelmethod commented 3 years ago

Hey! I'd also be interested in contributing to this, if I can be of any use.

Unless you're only interested in archiving from a small & pre-determined handful of sites, it'd be pretty challenging to try to archive multiple types of media on top of video, which is already difficult enough to reliably and flexibly do on its own. Do you have any thoughts about establishing a plugin system and/or an API to make it easy to build a new module for extracting e.g. an image from Twitter (or more generally $MEDIA from $SITE)? My sense here is that that'd be the best place to start from to make progress on this issue, but of course you may have had a different approach in mind.

loganwilliams commented 3 years ago

@kernelmethod I agree with that general approach!

Currently, there is a rather hacky set of nested logic (https://github.com/bellingcat/auto-archiver/blob/main/auto_archive.py#L404) that does the following:

A more extensible system, where a list of handlers could be registered and the first one that knew how to process the target handled each URL, would be great.

You are correct that in the general case, archiving arbitrary media from arbitrary sites is hard. However, like I mentioned in my last comment, there is existing work we can build on, like gallery-dl. Gallery-dl could be its own handler, and that by itself would support dozens more websites.

kernelmethod commented 3 years ago

A more extensible system, where a list of handlers could be registered and the first one that knew how to process the target handled each URL, would be great.

I think so too. Both youtube-dl and gallery-dl have similar systems: users subclass InfoExtractor or SearchInfoExtractor for the former

https://github.com/ytdl-org/youtube-dl/blob/a8035827177d6b59aca03bd717acb6a9bdd75ada/youtube_dl/extractor/common.py#L87

https://github.com/ytdl-org/youtube-dl/blob/a8035827177d6b59aca03bd717acb6a9bdd75ada/youtube_dl/extractor/common.py#L3023

and one of a few different classes in gallery_dl/extractor/common.py for the latter, e.g. Extractor or GalleryExtractor:

https://github.com/mikf/gallery-dl/blob/5f1b13d1a588574494d3ad3e7d9c45d3d5963c36/gallery_dl/extractor/common.py#L25

https://github.com/mikf/gallery-dl/blob/5f1b13d1a588574494d3ad3e7d9c45d3d5963c36/gallery_dl/extractor/common.py#L421

It might be reasonable to sketch out a similar API that could then call down to one of these, or be extended to archive media from another site.

kernelmethod commented 3 years ago

Okay: any initial attempt to make a similar system for this project will certainly start out hopelessly naive, but as a very rough first pass, maybe something like the following would be reasonable?

from argparse import ArgumentParser, Namespace
from typing import *

class ArchiverResult:
   # Would need to decide what goes into an ArchiverResult
    ...

@runtime_checkable
class AutoArchiverHandler(Protocol):
    # a domain, e.g. www.twitter.com, or protocol + domain, e.g. https://www.twitter.com
    root_url: str

    def __init__(self, args: Namespace) -> None:
        ...

    # Save media at some URL to one or more files / file-like abstractions 
    def download(self, url: str, pipes: Iterator[IO[bytes]]) -> Iterator[ArchiverResult]:
        ...

@runtime_checkable
class AutoArchiverArgsHandler(AutoArchiverHandler, Protocol):
    # Create an argument parser to parse sub-arguments for a given handler.
    # Arguments parsed by this ArgumentParser will then be passed in via
    # __init__ when the handler is instantiated.
    @classmethod
    def create_argparser(cls) -> ArgumentParser:
        ...

The idea here is that pipes would be a generator creating file objects that, upon calling write(), would upload the media to DO, update the Google Sheet, etc. When auto-archiver starts up, it looks through a list of registered handlers and dispatch based on the input URL and the handlers that are able to deal with that URL.

So here would be an extremely hacky way of making this work with gallery-dl:

import gallery_dl
import sys
import tempfile
from pathlib import Path

def create_gallerydl_handler(url: str):
    class GalleryDLHandler:
        root_url = url

        def __init__(self, args):
            ...

        def download(self, url: str, pipes: Iterator[BinaryIO]) -> Iterator[ArchiverResult]:
            with tempfile.TemporaryDirectory() as tmpdir:
                sys.argv = [sys.argv[0]] + [url, "-d", tmpdir] + sys.argv[1:]
                gallery_dl.main()

                # With all of the files downloaded by gallery-dl in the temporary directory, we
                # can now loop over those files and write them to the pipes
                for path in filter(lambda path: path.is_file(), Path(tmpdir).glob("**/*")):
                    with open(path, "rb") as f:
                        data = f.read()
                        next(pipes).write(data)
                    yield ArchiverResult()

        @classmethod
        def create_argparser(cls) -> ArgumentParser:
            return gallery_dl.option.build_parser()

    return GalleryDLHandler

Basically, we'd do something like

$ python auto_archive.py $URL -- $GALLERY_DL_ARGS

and get gallery-dl to automatically download the files to a temporary directory, which are then uploaded to DO + inserted into Google Sheets. If I took some time to understand how gallery-dl worked internally a bit better, then I might do something a bit smarter than calling gallery_dl.main(); in any case, though, handlers returned by create_gallerydl_handler would conform to the general AutoArchiverHandler protocol as well as the AutoArchiverArgsHandler.

In any case, here's a quick script that puts these pieces together that could perhaps be integrated into auto-archiver. The two major pieces it's missing right now are

  1. Logic for selecting which handler will be used (right now I've only created a single handler, handler_cls = create_gallerydl_handler("www.twitter.com"), which is automatically used to handle every input URL.)
  2. Logic for actually performing the upload to Digital Ocean / Google Sheets; right now there's a create_pipes function that just generates NamedTemporaryFiles that can be written to, but in practice you'd want to have a custom file-like object as mentioned earlier.

Example usage:

$ python3 example.py https://twitter.com/i/user/2976459548 -- --range "1-5"

example.py.txt