Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
Program crashes if the reference does not start strictly with lowercased http and with enabled rule encodeNonURICharacters. It does not matter what the rules are activated, even if this is only one. For example, that code from parsed html-page:
CrawlerNo1: 2016-08-28 18:08:57 WARN - Not able to obtain robots.txt at: null://null:80/robots.txt
org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:108)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:87)
at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DelayResolverStage.executeStage(HttpImporterPipeline.java:81)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:335)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:503)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:390)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:771)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
CrawlerNo1: 2016-08-28 18:08:57 ERROR - Cannot fetch metadata: null://null:80 (null protocol is not supported)
CrawlerNo1: 2016-08-28 18:08:57 INFO - REJECTED_ERROR: null://null:80
CrawlerNo1: 2016-08-28 18:08:57 ERROR - CrawlerNo1: Could not process document: null://null:80 (org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported)
com.norconex.collector.core.CollectorException: org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported
at com.norconex.collector.http.fetch.impl.GenericMetadataFetcher.fetchHTTPHeaders(GenericMetadataFetcher.java:172)
at com.norconex.collector.http.pipeline.importer.MetadataFetcherStage.executeStage(MetadataFetcherStage.java:51)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:335)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:503)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:390)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:771)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:108)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.norconex.collector.http.fetch.impl.GenericMetadataFetcher.fetchHTTPHeaders(GenericMetadataFetcher.java:135)
... 11 more
Program crashes if the reference does not start strictly with lowercased
http
and with enabled ruleencodeNonURICharacters
. It does not matter what the rules are activated, even if this is only one. For example, that code from parsed html-page:with enabled mentioned rule in configuration:
will cause next error: