Open blue-jam opened 2 years ago
A new snapshot release was just made with a fix that now considers the "effective" top-level domain for a URL instead of just the last two parts of the domain. It is using the Public Suffix List as you suggested.
That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp
so the effective top-level domain for your site is ipsj.or.jp
(as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.
To ensure only https
URLs get crawled for your site, I can think of two options:
ipsj.or.jp
.disableHSTS
to true
on the GenericHttpFetcher
and enforce https
using the GenericURLNormalizer
.Thank you very much for fixing it.
That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp so the effective top-level domain for your site is ipsj.or.jp (as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.
Actually, the URL I shared was just an example which I randomly picked from sites I was familiar with. However, your suggestions to mitigate another error message are very helpful.
I'm looking forward to a new release with the fix.
Summary
HstsResolver
doesn't handle country code second-level domains (e.g.co.jp
) well and emits a WARN log and fails to check HSTS support correctly.Reproduction
Run a collector with start URL =
https://www.ipsj.or.jp/english/index.html
.Actual behavior
HstsResovler
tries to communicate withor.jp
and emits a WARN message:Expected behavior
HstsResovler
tries to communicate withipsj.or.jp
.Resources
under which Internet users can (or historically could) directly register names
(not just country specific ones). It also provides information about Java libraries.