davidfstr / Crystal-Web-Archiver

Downloads websites for long-term archival.
http://dafoster.net/projects/crystal-web-archiver
60 stars 5 forks source link

Can mark resource group as "do not download" #72

Closed davidfstr closed 9 months ago

davidfstr commented 2 years ago

It has been observed while downloading http://animeworld.com/ that sometimes there are patterns of linked URLs that accept a connection yet never respond. For example:

These URLs take a long time to attempt to download and then time out (after 10 seconds), slowing the download of the entire project.

It would be nice if there was a way to mark certain URL patterns (i.e. ResourceGroups) in a project as "do not download" so that no download is even attempted.

Related screenshots:

Screen Shot 2022-07-24 at 11 41 59 AM Screen Shot 2022-07-24 at 11 46 33 AM
davidfstr commented 1 year ago

Another example: The following domain seems to always respond with HTTP 403 Forbidden:

Here's some shell code to ignore this domain manually:

# Ignore links from a particular domain
if True:
    import crystal.model
    import crystal.util.urls

    is_unrewritable_url1 = crystal.util.urls.is_unrewritable_url

    def is_unrewritable_url2(url: str) -> bool:
        return (
            is_unrewritable_url1(url) or
            '.pximg.net/' in url or
            '.pixiv.net/' in url
        )

    crystal.util.urls.is_unrewritable_url1 = is_unrewritable_url1  # save old
    crystal.util.urls.is_unrewritable_url2 = is_unrewritable_url2  # save new
    crystal.util.urls.is_unrewritable_url = crystal.util.urls.is_unrewritable_url2  # set new
    #crystal.util.urls.is_unrewritable_url = crystal.util.urls.is_unrewritable_url1  # restore old

    # Update imported versions of this function (IMPORTANT!)
    crystal.model.is_unrewritable_url = crystal.util.urls.is_unrewritable_url
davidfstr commented 9 months ago

Other sitations that make this feature useful: