Skallwar / suckit

Suck the InTernet
Apache License 2.0
734 stars 38 forks source link

Ignore pages that have a 404 status code #82

Open tbillington opened 4 years ago

tbillington commented 4 years ago

Currently suckit will save pages even if they are indicated as not found by the webserver. I think this is erroneous behaviour.

Eg this page on my site that 404s was saved to disk.

Chrome dev tools: Screen Shot 2020-05-10 at 8 01 14 pm

File explorer: Screen Shot 2020-05-10 at 8 01 22 pm

Skallwar commented 4 years ago

We could have one 404 error page by website

tbillington commented 4 years ago

As long as you're aware that is an opinionated choice :) some sites have custom 404s by section of the site etc, some will keep the original URL like in my screenshot, some will redirect to a dedicated 404 URL, some will show a 404 page with a 200 response.. Web crawling is messy!

Perhaps this could be a configuration thing, but that's up to you :)

Skallwar commented 3 years ago

A good solution can be to hash a 404 or 200 webpage. This way if the page is specific to this URL it is saved, if not we could make a symbolic link to the generic one.

tbillington commented 3 years ago

Yea I think it's tricky. If it's legitimately just a bad link to a page that was never existed or a href that was relative when it shouldn't have been you might hit an infinite loop (i've seen this in practise).

Skallwar commented 3 years ago

Humm ok. We have more serious issues and very little time currently, we will give this a try latter

tbillington commented 3 years ago

Yea no rush :)