aim42 / htmlSanityCheck

Standalone (batch- and command-line) and Gradle-plugin html sanity checker - detects missing images, dead links and cross-references, duplicate link targets (anchors) and the like.
Apache License 2.0
70 stars 47 forks source link

valid URL yields "unknown host with href" message #272

Closed gernotstarke closed 5 years ago

gernotstarke commented 5 years ago

as pointed out by @mernst the URL "https://douglascayers.com/2015/05/30/how-to-set-custom-java-path-after-installing-jdk-8/" leads to a "unknown host with href" error, although its syntactically valid AND functioning (http response code 200).

Created a regression test in BrokenHttpLinksCheckerSpec.groovy.

gernotstarke commented 5 years ago

Java wraps an "sun.security.Validator.ValidatorException" in the overly general "UnknownHost" exception.

Seems to be rooted in certain (often self-signed) certificates on the webserver. Browsers and other clients can cope with that, but java complains.

In this specific case the cert has been issued by LetsEncrypt.

I see two possible solutions:

  1. introduce a configurable "whitelist" of URL's that will be excluded from htmlSanityChecks
  2. deep-dive into java https protocol handshaking and come up with a general solution.

Considering my current situation, I'll at most go for nr 1 - any help appreciated.

double16 commented 5 years ago

You could configure Java to not validate SSL certs. It's about 2-3 lines of code. I did it some years back, a Google search would probably yield results.

mernst commented 5 years ago

Either solution would be fine with me.

gernotstarke commented 5 years ago

depending on the User-Agent property given to the HttpUrlConnection the results differ...

Especially the https://douglascayers website always breaks the ssl connection when given "Mozilla/5.0" as user-agent, but works fine with "Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0".

still need some more time.

gernotstarke commented 5 years ago

(hopefully) fixed by adding an insecure always-trusting TrustManager class... that ignores the ssl certificates.

In my tests on the Randoop fork all external links are now tested ok (@mernst - please verify)

mernst commented 5 years ago

This works! Thank you.