Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

HstsResolver mishandles country code second-level domains #785

Open blue-jam opened 2 years ago

blue-jam commented 2 years ago

Summary

HstsResolver doesn't handle country code second-level domains (e.g. co.jp) well and emits a WARN log and fails to check HSTS support correctly.

Reproduction

Run a collector with start URL = https://www.ipsj.or.jp/english/index.html.

Actual behavior

HstsResovler tries to communicate with or.jp and emits a WARN message:

WARN HstsResolver - Attempt to verify if the site supports Strict-Transport-Security (HSTS) failed for domain "or.jp". We'll assumume HSTS is not supported for all URLs on that domain.
  java.net.UnknownHostException: co.jp: No address associated with hostname
  at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
  at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929) ~[?:?]
  at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519) ~[?:?]
  at java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848) ~[?:?]
  at java.net.InetAddress.getAllByName0(InetAddress.java:1509) ~[?:?]
  at java.net.InetAddress.getAllByName(InetAddress.java:1368) ~[?:?]
  at java.net.InetAddress.getAllByName(InetAddress.java:1302) ~[?:?]
  at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar!/:4.5.13]
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.13.jar!/:4.5.13]
  at com.norconex.collector.http.fetch.util.HstsResolver.lambda$resolveHstsSupport$1(HstsResolver.java:105) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at java.util.HashMap.computeIfAbsent(HashMap.java:1134) ~[?:?]
  at com.norconex.collector.http.fetch.util.HstsResolver.resolveHstsSupport(HstsResolver.java:100) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.fetch.util.HstsResolver.resolve(HstsResolver.java:77) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.fetch.impl.GenericHttpFetcher.fetch(GenericHttpFetcher.java:399) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.fetch.HttpFetchClient.fetch(HttpFetchClient.java:102) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:99) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DelayResolverStage.executeStage(HttpImporterPipeline.java:89) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) ~[norconex-commons-lang-2.0.0.jar!/:2.0.0]
  at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:375) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
  at com.norconex.collector.core.crawler.Crawler.processNextQueuedCrawlData(Crawler.java:611) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
  at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:556) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
  at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:923) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
  at java.lang.Thread.run(Thread.java:829) ~[?:?]

Expected behavior

HstsResovler tries to communicate with ipsj.or.jp.

Resources

essiembre commented 2 years ago

A new snapshot release was just made with a fix that now considers the "effective" top-level domain for a URL instead of just the last two parts of the domain. It is using the Public Suffix List as you suggested.

That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp so the effective top-level domain for your site is ipsj.or.jp (as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.

To ensure only https URLs get crawled for your site, I can think of two options:

  1. Update the website so HSTS can be resolved against the top-level domain ipsj.or.jp.
  2. Update your crawler configuration to set disableHSTS to true on the GenericHttpFetcher and enforce https using the GenericURLNormalizer.
blue-jam commented 2 years ago

Thank you very much for fixing it.

That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp so the effective top-level domain for your site is ipsj.or.jp (as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.

Actually, the URL I shared was just an example which I randomly picked from sites I was familiar with. However, your suggestions to mitigate another error message are very helpful.

I'm looking forward to a new release with the fix.