404 error false positives

DanDiplo / Diplo.LinkChecker

Link Checker for Umbraco 7

3 stars 7 forks source link

404 error false positives #8

Open jerumschlasses opened 5 years ago

jerumschlasses commented 5 years ago

Hello, I'm trying to figure out why specific, external URLs are causing a 404 error when checked. Server timeout isn't a factor, and the links aren't really broken. Your documentation notes some servers might reject the Link Checker request. Please could you explain that a little more?

DanDiplo commented 5 years ago

Hi. Are the links actually returning a 404 or is it a different error code (such as 405 Method Not Allowed?). There can be a number of reasons for this.

The link checker makes what is called an HTTP HEAD (rather than a normal GET) request. A HEAD request just downloads the headers of the page and not the entire body, so it's much faster to check whether a page exists or not. However, some servers don't accept this request and so return a 405. Popular examples of this include Instagram and LinkedIn.

It's also possible that some servers reject the checker because it doesn't appear to be a normal browser - they make have measures to deter what it sees as an unauthorised "crawler".

You can configure it to ignore certain HTTP response codes, so if you get a lot of 405 codes then just get it to check 404s instead.

jerumschlasses commented 5 years ago

Thank you for this explanation. Yes, the error code is 404. I'm filtering several other codes, because of our network's blocking of social media sites--lots of 403s. I love the error code filter feature.

Perhaps a good new feature would be to have an option for a normal Get request, knowing it takes much longer and only applying it after the head request, and only to the existing 404 errors.

DanDiplo commented 5 years ago

Yes, It's something I've thought about adding. Could you provide the URL of the pages that exist but return a 404, please? Just so I can see if I can figure out why it returns a 404 if it exists. Thanks.

jerumschlasses commented 5 years ago

Here's the URL: https://dev.virtualearth.net/REST/V1/Imagery/Map/Aerial/63.5%2C-159.869/3?mapSize=470,310&format=png&pushpin=60.1077,-149.4438;63;&key=AtmtlwuFfwjzB0QNo0bIMyEEWvtx9F4lJVDM21FIdB2JxoR_ep4-_uC5wTdie4o0

DanDiplo commented 5 years ago

Hi. So it looks like that server doesn't support HEAD requests. If you request the URL using GET it detects it, but not using HEAD. Unfortunately it seems to be misconfigured, because it shouldn't be returning a 404 code for this but a 405.

Next time I work on the tool I'll look at adding a fallback GET requests for these "failures".

jerumschlasses commented 5 years ago

Thank you, Dan. I appreciate your digging into the issue.