Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

handshake_failure #613

Closed FcrbPeter closed 5 years ago

FcrbPeter commented 5 years ago

I am crawling a website with https. And it seems the ssl cannot support....

I am using java version "1.8.0_202" and Norconex http 2.9.0 snapshot

below is the config.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="MCGCS Web crawler">

  <!-- Decide where to store generated files. -->
  <progressDir>./output/progress</progressDir>
  <logsDir>./output/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">
      <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
        <!-- <url>http://app3.rthk.hk/search/google/start.php</url> -->
        <!-- <url>http://programme.rthk.hk/archivelist_gsa.php?channel=dtt31</url> -->
        <!-- <url>https://www.rthk.hk/</url> -->
        <url>https://news.rthk.hk/</url>
        <!-- <url>http://podcast.rthk.hk/</url> -->
        <!-- <url>http://app4.rthk.hk/special/rthkmemory/</url> -->
        <!-- <url>http://app4.rthk.hk/elearning/healthpedia/</url> -->
      </startURLs>

      <documentFilters>
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include"> -->
    <!-- ^http\:\/\/app3\.rthk\.hk\/search\/google\/start\.php -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    http://app3.rthk.hk/search/google/start.php
</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    rthk.hk/
</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    rthk.org.hk/
</filter>
      </documentFilters>

      <referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">
    html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf
</filter>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">
    jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    ^http://.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    ^https://.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.org\.hk/.*
</filter>
      </referenceFilters>

      <userAgent>gsa-crawler</userAgent>
      <workDir>./output</workDir>

      <orphansStrategy>DELETE</orphansStrategy>

      <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" charset="UTF-8" />
      </linkExtractors>
      <httpClientFactory>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
      </httpClientFactory>
      <!-- <sitemapResolverFactory ignore="true" /> -->
      <!-- <robotsTxt ignore="true" /> -->
      <!-- <robotsMeta ignore="true" /> -->

      <maxDepth>-1</maxDepth>
      <numThreads>4</numThreads>
      <delay default="100" scope="thread" />

      <importer>
        <preParseHandlers>
          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
            <restrictTo field="document.contentType">text/html</restrictTo>
            <stripBetween>
              <start><![CDATA[<!--googleoff: index-->]]></start>
              <end><![CDATA[<!--googleon: index-->]]></end>
            </stripBetween>
          </transformer>
          <transformer class="com.norconex.importer.handler.transformer.impl.StripAfterTransformer" inclusive="true">
            <restrictTo field="document.contentType">text/html</restrictTo>
            <stripAfterRegex><![CDATA[<!--googleoff: index-->]]></stripAfterRegex>
          </transformer>
          <tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>
        </preParseHandlers>

        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>binaryContent,document.reference,document.contentType,collection,score,title,description,channelName,programmeName,episodeDate,episodeName,image</fields>
          </tagger>
        </postParseHandlers>
      </importer>

      <!-- <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./output/crawledFiles</directory>
      </committer> -->

      <!-- <committer class="com.norconex.committer.core.impl.NilCommitter" /> -->

      <committer class="com.norconex.committer.core.impl.JSONFileCommitter">
        <directory>./output/crawledFiles</directory>
        <pretty>true</pretty>
        <docsPerFile>200</docsPerFile>
        <compress>false</compress>
        <splitAddDelete>false</splitAddDelete>
      </committer>

      <!-- <committer class="com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter">
        <configFilePath>./config/sdk-configuration.properties</configFilePath>
        <uploadFormat>raw</uploadFormat>
      </committer> -->

    </crawler>
  </crawlers>

</httpcollector>

And below is the log.

➜  crawler ./collector-http.sh -a start -c config/config.xml
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=INCLUDE,extensions=html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf,caseSensitive=false]
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv,caseSensitive=false]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^http://.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^https://.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.hk/.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.org\.hk/.*]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=http://app3.rthk.hk/search/google/start.php]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.hk/]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.org.hk/]
INFO  [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},schemes={http,https,ftp},maxURLLength=2048,ignoreNofollow=false,commentsEnabled=false,tagAttribs=ObservableMap [map={a=[href], img=[src], meta=[http-equiv], iframe=[src], frame=[src]}],charset=UTF-8,extractBetweens=[],noExtractBetweens=[],extractSelectors=[],noExtractSelectors=[]]
INFO  [AbstractCollectorConfig] Configuration loaded: id=MCGCS Web crawler; logsDir=./output/logs; progressDir=./output/progress
INFO  [JobSuite] JEF work directory is: ./output/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.9.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.9.1-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.3-SNAPSHOT (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Thu Jun 13 12:42:32 HKT 2019)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: gsa-crawler
INFO  [GenericHttpClientFactory] SSL: Trusting all certificates.
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
WARN  [StandardRobotsTxtProvider] Not able to obtain robots.txt at: https://news.rthk.hk/robots.txt
javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:154)
        at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2020)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1127)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:93)
        at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.getRobotsTxt(HttpQueuePipeline.java:78)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.findRejectingRobotsFilter(RobotsTxtFiltersStage.java:69)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.executeStage(RobotsTxtFiltersStage.java:46)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:280)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:156)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:140)
        at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:131)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:353)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:292)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:165)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:150)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap.xml (Received fatal alert: handshake_failure)
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap_index.xml (Received fatal alert: handshake_failure)
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [GenericDocumentFetcher] Cannot fetch document: https://news.rthk.hk/ (Received fatal alert: handshake_failure)
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://news.rthk.hk/ (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Could not process document: https://news.rthk.hk/ (javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure)
INFO  [AbstractCrawler] Norconex Minimum Test Page: 100% completed (1 processed/1 total)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Deleting orphan references (if any)...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Deleted 0 orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 1 minute 21 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Thu Jun 13 12:42:32 HKT 2019)
FcrbPeter commented 5 years ago

The link below is the analyse of the website ssl certificate. https://www.ssllabs.com/ssltest/analyze.html?viaform=on&d=news.rthk.hk

jetnet commented 5 years ago

try this: <trustAllSSLCertificates>false</trustAllSSLCertificates>

This options has a side effect - it disables the SNI feature, and some site do require this header to be present. @essiembre , maybe it makes sense NOT to disable SNI when trusting all certs? As I spent lot of time on debugging the same issue recently. Thanks!

Test with OpenSSL 0.9.8zh: without SNI:

openssl s_client -host news.rthk.hk -port 443

CONNECTED(00000003)
37845:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/BuildRoot/Library/Caches/com.apple.xbs/Sources/OpenSSL098/OpenSSL098-64.50.7/src/ssl/s23_lib.c:185:

and with SNI:

openssl s_client -host news.rthk.hk -port 443 -servername news.rthk.hk

CONNECTED(00000003)
depth=2 /C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
verify error:num=19:self signed certificate in certificate chain
verify return:0
---
Certificate chain
 0 s:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hong Kong SAR Government/OU=0002104861/OU=000000000000000000000000RTHK/OU=Hongkong Post e-Cert (Server)/OU=Radio Television Hong Kong/CN=*.rthk.hk
   i:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hongkong Post/CN=Hongkong Post e-Cert CA 1 - 15
 1 s:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hongkong Post/CN=Hongkong Post e-Cert CA 1 - 15
   i:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
 2 s:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
   i:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
---
Server certificate
-----BEGIN CERTIFICATE-----
FcrbPeter commented 5 years ago

Thanks for replying!

I have tried out the <trustAllSSLCertificates>false</trustAllSSLCertificates>

And it shows another error.

➜  crawler ./collector-http.sh -a start -c config/config.xml
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=INCLUDE,extensions=html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf,caseSensitive=false]
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv,caseSensitive=false]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^http://.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^https://.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.hk/.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.org\.hk/.*]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=http://app3.rthk.hk/search/google/start.php]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.hk/]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.org.hk/]
INFO  [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},schemes={http,https,ftp},maxURLLength=2048,ignoreNofollow=false,commentsEnabled=false,tagAttribs=ObservableMap [map={a=[href], img=[src], meta=[http-equiv], iframe=[src], frame=[src]}],charset=UTF-8,extractBetweens=[],noExtractBetweens=[],extractSelectors=[],noExtractSelectors=[]]
INFO  [AbstractCollectorConfig] Configuration loaded: id=MCGCS Web crawler; logsDir=./output/logs; progressDir=./output/progress
INFO  [JobSuite] JEF work directory is: ./output/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.9.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.9.1-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.3-SNAPSHOT (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Fri Jun 14 11:16:28 HKT 2019)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: gsa-crawler
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
WARN  [StandardRobotsTxtProvider] Not able to obtain robots.txt at: https://news.rthk.hk/robots.txt
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
        at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310)
        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639)
        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:965)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:93)
        at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.getRobotsTxt(HttpQueuePipeline.java:78)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.findRejectingRobotsFilter(RobotsTxtFiltersStage.java:69)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.executeStage(RobotsTxtFiltersStage.java:46)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:280)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:156)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:140)
        at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:131)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:353)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:292)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:165)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:150)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397)
        at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302)
        at sun.security.validator.Validator.validate(Validator.java:262)
        at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
        at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
        at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621)
        ... 40 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
        at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
        at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
        at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392)
        ... 46 more
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap.xml (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap_index.xml (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [GenericDocumentFetcher] Cannot fetch document: https://news.rthk.hk/ (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://news.rthk.hk/ (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Could not process document: https://news.rthk.hk/ (javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Deleting orphan references (if any)...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Deleted 0 orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 0 second.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Fri Jun 14 11:16:28 HKT 2019)
FcrbPeter commented 5 years ago

I search around about the PKIX path building failed error.

And found this: https://github.com/escline/InstallCert After installing the cert, the collector works fine with <trustAllSSLCertificates>false</trustAllSSLCertificates>

essiembre commented 5 years ago

Glad you found a solution. Thanks for sharing it.

@jetnet, there is a pull request to control SNI enabling at the crawler level at #577. Since it requires Java 8, it will be part of the next major release.