handshake_failure - Githubissues

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Apache License 2.0

184 stars 67 forks source link

I am crawling a website with https. And it seems the ssl cannot support....

I am using java version "1.8.0_202" and Norconex http 2.9.0 snapshot

below is the config.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="MCGCS Web crawler">

  <!-- Decide where to store generated files. -->
  <progressDir>./output/progress</progressDir>
  <logsDir>./output/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">
      <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
        <!-- <url>http://app3.rthk.hk/search/google/start.php</url> -->
        <!-- <url>http://programme.rthk.hk/archivelist_gsa.php?channel=dtt31</url> -->
        <!-- <url>https://www.rthk.hk/</url> -->
        <url>https://news.rthk.hk/</url>
        <!-- <url>http://podcast.rthk.hk/</url> -->
        <!-- <url>http://app4.rthk.hk/special/rthkmemory/</url> -->
        <!-- <url>http://app4.rthk.hk/elearning/healthpedia/</url> -->
      </startURLs>

      <documentFilters>
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include"> -->
    <!-- ^http\:\/\/app3\.rthk\.hk\/search\/google\/start\.php -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    http://app3.rthk.hk/search/google/start.php
</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    rthk.hk/
</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    rthk.org.hk/
</filter>
      </documentFilters>

      <referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">
    html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf
</filter>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">
    jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    ^http://.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    ^https://.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.org\.hk/.*
</filter>
      </referenceFilters>

      <userAgent>gsa-crawler</userAgent>
      <workDir>./output</workDir>

      <orphansStrategy>DELETE</orphansStrategy>

      <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" charset="UTF-8" />
      </linkExtractors>
      <httpClientFactory>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
      </httpClientFactory>
      <!-- <sitemapResolverFactory ignore="true" /> -->
      <!-- <robotsTxt ignore="true" /> -->
      <!-- <robotsMeta ignore="true" /> -->

      <maxDepth>-1</maxDepth>
      <numThreads>4</numThreads>
      <delay default="100" scope="thread" />

      <importer>
        <preParseHandlers>
          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
            <restrictTo field="document.contentType">text/html</restrictTo>
            <stripBetween>
              <start><![CDATA[<!--googleoff: index-->]]></start>
              <end><![CDATA[<!--googleon: index-->]]></end>
            </stripBetween>
          </transformer>
          <transformer class="com.norconex.importer.handler.transformer.impl.StripAfterTransformer" inclusive="true">
            <restrictTo field="document.contentType">text/html</restrictTo>
            <stripAfterRegex><![CDATA[<!--googleoff: index-->]]></stripAfterRegex>
          </transformer>
          <tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>
        </preParseHandlers>

        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>binaryContent,document.reference,document.contentType,collection,score,title,description,channelName,programmeName,episodeDate,episodeName,image</fields>
          </tagger>
        </postParseHandlers>
      </importer>

      <!-- <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./output/crawledFiles</directory>
      </committer> -->

      <!-- <committer class="com.norconex.committer.core.impl.NilCommitter" /> -->

      <committer class="com.norconex.committer.core.impl.JSONFileCommitter">
        <directory>./output/crawledFiles</directory>
        <pretty>true</pretty>
        <docsPerFile>200</docsPerFile>
        <compress>false</compress>
        <splitAddDelete>false</splitAddDelete>
      </committer>

      <!-- <committer class="com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter">
        <configFilePath>./config/sdk-configuration.properties</configFilePath>
        <uploadFormat>raw</uploadFormat>
      </committer> -->

    </crawler>
  </crawlers>

</httpcollector>

And below is the log.

➜  crawler ./collector-http.sh -a start -c config/config.xml
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=INCLUDE,extensions=html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf,caseSensitive=false]
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv,caseSensitive=false]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^http://.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^https://.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.hk/.*]
INFO  [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.org\.hk/.*]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=http://app3.rthk.hk/search/google/start.php]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.hk/]
INFO  [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.org.hk/]
INFO  [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},schemes={http,https,ftp},maxURLLength=2048,ignoreNofollow=false,commentsEnabled=false,tagAttribs=ObservableMap [map={a=[href], img=[src], meta=[http-equiv], iframe=[src], frame=[src]}],charset=UTF-8,extractBetweens=[],noExtractBetweens=[],extractSelectors=[],noExtractSelectors=[]]
INFO  [AbstractCollectorConfig] Configuration loaded: id=MCGCS Web crawler; logsDir=./output/logs; progressDir=./output/progress
INFO  [JobSuite] JEF work directory is: ./output/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.9.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.9.1-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.3-SNAPSHOT (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Thu Jun 13 12:42:32 HKT 2019)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: gsa-crawler
INFO  [GenericHttpClientFactory] SSL: Trusting all certificates.
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
WARN  [StandardRobotsTxtProvider] Not able to obtain robots.txt at: https://news.rthk.hk/robots.txt
javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:154)
        at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2020)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1127)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
        at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:93)
        at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.getRobotsTxt(HttpQueuePipeline.java:78)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.findRejectingRobotsFilter(RobotsTxtFiltersStage.java:69)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.executeStage(RobotsTxtFiltersStage.java:46)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:280)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:156)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:140)
        at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:131)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:353)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:292)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:165)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:150)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap.xml (Received fatal alert: handshake_failure)
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap_index.xml (Received fatal alert: handshake_failure)
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [GenericDocumentFetcher] Cannot fetch document: https://news.rthk.hk/ (Received fatal alert: handshake_failure)
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://news.rthk.hk/ (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Could not process document: https://news.rthk.hk/ (javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure)
INFO  [AbstractCrawler] Norconex Minimum Test Page: 100% completed (1 processed/1 total)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Deleting orphan references (if any)...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Deleted 0 orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 1 minute 21 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Thu Jun 13 12:42:32 HKT 2019)

openssl s_client -host news.rthk.hk -port 443 CONNECTED(00000003) 37845:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/BuildRoot/Library/Caches/com.apple.xbs/Sources/OpenSSL098/OpenSSL098-64.50.7/src/ssl/s23_lib.c:185:

openssl s_client -host news.rthk.hk -port 443 -servername news.rthk.hk CONNECTED(00000003) depth=2 /C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1 verify error:num=19:self signed certificate in certificate chain verify return:0 --- Certificate chain 0 s:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hong Kong SAR Government/OU=0002104861/OU=000000000000000000000000RTHK/OU=Hongkong Post e-Cert (Server)/OU=Radio Television Hong Kong/CN=*.rthk.hk i:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hongkong Post/CN=Hongkong Post e-Cert CA 1 - 15 1 s:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hongkong Post/CN=Hongkong Post e-Cert CA 1 - 15 i:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1 2 s:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1 i:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1 --- Server certificate -----BEGIN CERTIFICATE-----

➜ crawler ./collector-http.sh -a start -c config/config.xml INFO [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=INCLUDE,extensions=html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf,caseSensitive=false] INFO [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv,caseSensitive=false] INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^http://.*] INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^https://.*] INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.hk/.*] INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.org\.hk/.*] INFO [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=http://app3.rthk.hk/search/google/start.php] INFO [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.hk/] INFO [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.org.hk/] INFO [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},schemes={http,https,ftp},maxURLLength=2048,ignoreNofollow=false,commentsEnabled=false,tagAttribs=ObservableMap [map={a=[href], img=[src], meta=[http-equiv], iframe=[src], frame=[src]}],charset=UTF-8,extractBetweens=[],noExtractBetweens=[],extractSelectors=[],noExtractSelectors=[]] INFO [AbstractCollectorConfig] Configuration loaded: id=MCGCS Web crawler; logsDir=./output/logs; progressDir=./output/progress INFO [JobSuite] JEF work directory is: ./output/progress INFO [JobSuite] JEF log manager is : FileLogManager INFO [JobSuite] JEF job status store is : FileJobStatusStore INFO [AbstractCollector] Suite of 1 crawler jobs created. INFO [JobSuite] Initialization... INFO [JobSuite] Previous execution detected. INFO [JobSuite] Backing up previous execution status and log files. INFO [JobSuite] Starting execution. INFO [AbstractCollector] Version: Norconex HTTP Collector 2.9.0-SNAPSHOT (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Collector Core 1.9.2-SNAPSHOT (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Importer 2.9.1-SNAPSHOT (Norconex Inc.) INFO [AbstractCollector] Version: Norconex JEF 4.1.2-SNAPSHOT (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Committer Core 2.1.3-SNAPSHOT (Norconex Inc.) INFO [JobSuite] Running Norconex Minimum Test Page: BEGIN (Fri Jun 14 11:16:28 HKT 2019) INFO [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true INFO [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true INFO [HttpCrawler] Norconex Minimum Test Page: Sitemap support: true INFO [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true INFO [HttpCrawler] Norconex Minimum Test Page: User-Agent: gsa-crawler INFO [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store... INFO [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store. WARN [StandardRobotsTxtProvider] Not able to obtain robots.txt at: https://news.rthk.hk/robots.txt javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) at sun.security.ssl.Handshaker.process_record(Handshaker.java:965) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:93) at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.getRobotsTxt(HttpQueuePipeline.java:78) at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.findRejectingRobotsFilter(RobotsTxtFiltersStage.java:69) at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.executeStage(RobotsTxtFiltersStage.java:46) at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31) at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:280) at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:156) at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:140) at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:131) at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216) at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184) at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49) at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:353) at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:292) at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:165) at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:150) at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95) at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74) Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302) at sun.security.validator.Validator.validate(Validator.java:262) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621) ... 40 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392) ... 46 more ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap.xml (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target) ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap_index.xml (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target) INFO [HttpCrawler] 1 start URLs identified. INFO [CrawlerEventManager] CRAWLER_STARTED INFO [AbstractCrawler] Norconex Minimum Test Page: Crawling references... INFO [GenericDocumentFetcher] Cannot fetch document: https://news.rthk.hk/ (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target) INFO [CrawlerEventManager] REJECTED_ERROR: https://news.rthk.hk/ (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target) INFO [AbstractCrawler] Norconex Minimum Test Page: Could not process document: https://news.rthk.hk/ (javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target) INFO [AbstractCrawler] Norconex Minimum Test Page: Deleting orphan references (if any)... INFO [AbstractCrawler] Norconex Minimum Test Page: Deleted 0 orphan references... INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents. INFO [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed. INFO [CrawlerEventManager] CRAWLER_FINISHED INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler completed. INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 0 second. INFO [SitemapStore] Norconex Minimum Test Page: Closing sitemap store... INFO [JobSuite] Running Norconex Minimum Test Page: END (Fri Jun 14 11:16:28 HKT 2019)

Norconex / crawlers

handshake_failure #613