Closed FcrbPeter closed 5 years ago
The link below is the analyse of the website ssl certificate. https://www.ssllabs.com/ssltest/analyze.html?viaform=on&d=news.rthk.hk
try this: <trustAllSSLCertificates>false</trustAllSSLCertificates>
This options has a side effect - it disables the SNI feature, and some site do require this header to be present. @essiembre , maybe it makes sense NOT to disable SNI when trusting all certs? As I spent lot of time on debugging the same issue recently. Thanks!
Test with OpenSSL 0.9.8zh: without SNI:
openssl s_client -host news.rthk.hk -port 443
CONNECTED(00000003)
37845:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/BuildRoot/Library/Caches/com.apple.xbs/Sources/OpenSSL098/OpenSSL098-64.50.7/src/ssl/s23_lib.c:185:
and with SNI:
openssl s_client -host news.rthk.hk -port 443 -servername news.rthk.hk
CONNECTED(00000003)
depth=2 /C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
verify error:num=19:self signed certificate in certificate chain
verify return:0
---
Certificate chain
0 s:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hong Kong SAR Government/OU=0002104861/OU=000000000000000000000000RTHK/OU=Hongkong Post e-Cert (Server)/OU=Radio Television Hong Kong/CN=*.rthk.hk
i:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hongkong Post/CN=Hongkong Post e-Cert CA 1 - 15
1 s:/C=HK/ST=Hong Kong/L=Hong Kong/O=Hongkong Post/CN=Hongkong Post e-Cert CA 1 - 15
i:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
2 s:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
i:/C=HK/O=Hongkong Post/CN=Hongkong Post Root CA 1
---
Server certificate
-----BEGIN CERTIFICATE-----
Thanks for replying!
I have tried out the <trustAllSSLCertificates>false</trustAllSSLCertificates>
And it shows another error.
➜ crawler ./collector-http.sh -a start -c config/config.xml
INFO [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=INCLUDE,extensions=html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf,caseSensitive=false]
INFO [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv,caseSensitive=false]
INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^http://.*]
INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=^https://.*]
INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.hk/.*]
INFO [AbstractCrawlerConfig] Reference filter loaded: RegexReferenceFilter[onMatch=INCLUDE,caseSensitive=false,regex=.*rthk\.org\.hk/.*]
INFO [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=http://app3.rthk.hk/search/google/start.php]
INFO [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.hk/]
INFO [AbstractCrawlerConfig] Document filter loaded: ContainsReferenceFilter[onMatch=INCLUDE,phase=rthk.org.hk/]
INFO [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},schemes={http,https,ftp},maxURLLength=2048,ignoreNofollow=false,commentsEnabled=false,tagAttribs=ObservableMap [map={a=[href], img=[src], meta=[http-equiv], iframe=[src], frame=[src]}],charset=UTF-8,extractBetweens=[],noExtractBetweens=[],extractSelectors=[],noExtractSelectors=[]]
INFO [AbstractCollectorConfig] Configuration loaded: id=MCGCS Web crawler; logsDir=./output/logs; progressDir=./output/progress
INFO [JobSuite] JEF work directory is: ./output/progress
INFO [JobSuite] JEF log manager is : FileLogManager
INFO [JobSuite] JEF job status store is : FileJobStatusStore
INFO [AbstractCollector] Suite of 1 crawler jobs created.
INFO [JobSuite] Initialization...
INFO [JobSuite] Previous execution detected.
INFO [JobSuite] Backing up previous execution status and log files.
INFO [JobSuite] Starting execution.
INFO [AbstractCollector] Version: Norconex HTTP Collector 2.9.0-SNAPSHOT (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Collector Core 1.9.2-SNAPSHOT (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Importer 2.9.1-SNAPSHOT (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex JEF 4.1.2-SNAPSHOT (Norconex Inc.)
INFO [AbstractCollector] Version: Norconex Committer Core 2.1.3-SNAPSHOT (Norconex Inc.)
INFO [JobSuite] Running Norconex Minimum Test Page: BEGIN (Fri Jun 14 11:16:28 HKT 2019)
INFO [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO [HttpCrawler] Norconex Minimum Test Page: Sitemap support: true
INFO [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO [HttpCrawler] Norconex Minimum Test Page: User-Agent: gsa-crawler
INFO [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
WARN [StandardRobotsTxtProvider] Not able to obtain robots.txt at: https://news.rthk.hk/robots.txt
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639)
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:965)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:93)
at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.getRobotsTxt(HttpQueuePipeline.java:78)
at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.findRejectingRobotsFilter(RobotsTxtFiltersStage.java:69)
at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.executeStage(RobotsTxtFiltersStage.java:46)
at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:280)
at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:156)
at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:140)
at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:131)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:353)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:292)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:165)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:150)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397)
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302)
at sun.security.validator.Validator.validate(Validator.java:262)
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621)
... 40 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392)
... 46 more
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap.xml (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
ERROR [StandardSitemapResolver] Cannot fetch sitemap: https://news.rthk.hk/sitemap_index.xml (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO [HttpCrawler] 1 start URLs identified.
INFO [CrawlerEventManager] CRAWLER_STARTED
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO [GenericDocumentFetcher] Cannot fetch document: https://news.rthk.hk/ (sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO [CrawlerEventManager] REJECTED_ERROR: https://news.rthk.hk/ (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO [AbstractCrawler] Norconex Minimum Test Page: Could not process document: https://news.rthk.hk/ (javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO [AbstractCrawler] Norconex Minimum Test Page: Deleting orphan references (if any)...
INFO [AbstractCrawler] Norconex Minimum Test Page: Deleted 0 orphan references...
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO [CrawlerEventManager] CRAWLER_FINISHED
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 0 second.
INFO [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO [JobSuite] Running Norconex Minimum Test Page: END (Fri Jun 14 11:16:28 HKT 2019)
I search around about the PKIX path building failed
error.
And found this: https://github.com/escline/InstallCert
After installing the cert, the collector works fine with <trustAllSSLCertificates>false</trustAllSSLCertificates>
Glad you found a solution. Thanks for sharing it.
@jetnet, there is a pull request to control SNI enabling at the crawler level at #577. Since it requires Java 8, it will be part of the next major release.
I am crawling a website with https. And it seems the ssl cannot support....
I am using java version "1.8.0_202" and Norconex http 2.9.0 snapshot
below is the config.
And below is the log.