Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Working Examples in Getting Started Fail #625

Closed krcm0209 closed 3 years ago

krcm0209 commented 5 years ago

After I run the minimal test example, there is no crawledFiles directory. Looking at the output, it looks like it may be related to a javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target exception. https://www.norconex.com/collectors/collector-http/getting-started#working-examples

user@PC:/mnt/c/norconex-collector-http-2.8.1$ ./collector-http.sh -a start -c examples/minimum/minimum-config.xml
Jul 17, 2019 2:06:50 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.tika.parser.ParseContext (file:/mnt/c/norconex-collector-http-2.8.1/lib/tika-core-1.16.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)
WARNING: Please consider reporting this to the maintainers of org.apache.tika.parser.ParseContext
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress
INFO  [JobSuite] JEF work directory is: ./examples-output/minimum/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.8.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.9.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.2 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Wed Jul 17 14:06:51 EDT 2019)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: false
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
WARN  [StandardRobotsTxtProvider] Not able to obtain robots.txt at: https://www.norconex.com/robots.txt
javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
        at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:321)
        at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
        at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:259)
        at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:642)
        at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:461)
        at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:361)
        at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392)
        at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:448)
        at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:425)
        at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:178)
        at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:164)
        at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1152)
        at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1063)
        at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:402)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
        at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:93)
        at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline.getRobotsTxt(HttpQueuePipeline.java:78)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.findRejectingRobotsFilter(RobotsTxtFiltersStage.java:69)
        at com.norconex.collector.http.pipeline.queue.RobotsTxtFiltersStage.executeStage(RobotsTxtFiltersStage.java:46)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
        at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:280)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:156)
        at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:140)
        at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:131)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:184)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:131)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:74)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:385)
        at java.base/sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:290)
        at java.base/sun.security.validator.Validator.validate(Validator.java:264)
        at java.base/sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:321)
        at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:221)
        at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:129)
        at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:626)
        ... 43 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at java.base/sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
        at java.base/sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
        at java.base/java.security.cert.CertPathBuilder.build(CertPathBuilder.java:297)
        at java.base/sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:380)
        ... 49 more
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [GenericDocumentFetcher] Cannot fetch document: https://www.norconex.com/product/collector-http-test/minimum.php (PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://www.norconex.com/product/collector-http-test/minimum.php (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Could not process document: https://www.norconex.com/product/collector-http-test/minimum.php (javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 1 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 0 second.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Wed Jul 17 14:06:51 EDT 2019)
user@PC:/mnt/c/norconex-collector-http-2.8.1$ 

OS info:

NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
essiembre commented 5 years ago

Likely a duplicate. See solutions at https://github.com/Norconex/collector-http/issues/561#issuecomment-495981622 and https://github.com/Norconex/collector-http/issues/613#issuecomment-501959307

Please confirm.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.