bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
581 stars 61 forks source link

When .html is in the path screenshot saves as .html #67

Closed djhmateer closed 1 year ago

djhmateer commented 2 years ago

http://brokenlinkcheckerchecker.com/pagea.html

The screenshot would save the png as wayback_pageb-2022-11-11t10-55-09-277235.html

A simple fix is in base_archiver.py - commented out a line at the bottom of the function:

    def _get_key_from_url(self, url, with_extension: str = None, append_datetime: bool = False):
        """
        Receives a URL and returns a slugified version of the URL path
        if a string is passed in @with_extension the slug is appended with it if there is no "." in the slug
        if @append_date is true, the key adds a timestamp after the URL slug and before the extension
        """
        url_path = urlparse(url).path
        path, ext = os.path.splitext(url_path)
        slug = slugify(path)
        if append_datetime:
            slug += "-" + slugify(datetime.datetime.utcnow().isoformat())
        if len(ext):
            slug += ext
        if with_extension is not None:
            # I have a url with .html in the path, and want the screenshot to be .png
            # eg http://brokenlinkcheckerchecker.com/pageb.html
            # am happy with .html.png as a file extension
            # commented out the follow line to fix
            # unsure as to why this is here 
            # if "." not in slug:
                slug += with_extension
        return self.get_key(slug)

which then gives wayback_pageb-2022-11-11t10-55-09-277235.html.png

Happy to do a PR if I've not missed any understanding here.

loganwilliams commented 1 year ago

This is no longer an issue, thank you for the report!