bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
MIT License
489 stars 53 forks source link

When .html is in the path screenshot saves as .html #67

Closed djhmateer closed 1 year ago

djhmateer commented 1 year ago

The screenshot would save the png as wayback_pageb-2022-11-11t10-55-09-277235.html

A simple fix is in - commented out a line at the bottom of the function:

    def _get_key_from_url(self, url, with_extension: str = None, append_datetime: bool = False):
        Receives a URL and returns a slugified version of the URL path
        if a string is passed in @with_extension the slug is appended with it if there is no "." in the slug
        if @append_date is true, the key adds a timestamp after the URL slug and before the extension
        url_path = urlparse(url).path
        path, ext = os.path.splitext(url_path)
        slug = slugify(path)
        if append_datetime:
            slug += "-" + slugify(datetime.datetime.utcnow().isoformat())
        if len(ext):
            slug += ext
        if with_extension is not None:
            # I have a url with .html in the path, and want the screenshot to be .png
            # eg
            # am happy with .html.png as a file extension
            # commented out the follow line to fix
            # unsure as to why this is here 
            # if "." not in slug:
                slug += with_extension
        return self.get_key(slug)

which then gives wayback_pageb-2022-11-11t10-55-09-277235.html.png

Happy to do a PR if I've not missed any understanding here.

loganwilliams commented 1 year ago

This is no longer an issue, thank you for the report!