Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

The program crashing under the rule encodeNonURICharacters #294

Closed xnerus closed 7 years ago

xnerus commented 8 years ago

Program crashes if the reference does not start strictly with lowercased http and with enabled rule encodeNonURICharacters. It does not matter what the rules are activated, even if this is only one. For example, that code from parsed html-page:

<a href="HTTP://example.com/">Link 1</a>

with enabled mentioned rule in configuration:

    <urlNormalizer class="$urlNormalizer">
        <normalizations>
            encodeNonURICharacters
        </normalizations>
    </urlNormalizer>

will cause next error:

CrawlerNo1: 2016-08-28 18:08:57 WARN - Not able to obtain robots.txt at: null://null:80/robots.txt
org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:108)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
    at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:87)
    at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DelayResolverStage.executeStage(HttpImporterPipeline.java:81)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:335)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:503)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:390)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:771)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
CrawlerNo1: 2016-08-28 18:08:57 ERROR - Cannot fetch metadata: null://null:80 (null protocol is not supported)
CrawlerNo1: 2016-08-28 18:08:57 INFO -            REJECTED_ERROR: null://null:80
CrawlerNo1: 2016-08-28 18:08:57 ERROR - CrawlerNo1: Could not process document: null://null:80 (org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported)
com.norconex.collector.core.CollectorException: org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported
    at com.norconex.collector.http.fetch.impl.GenericMetadataFetcher.fetchHTTPHeaders(GenericMetadataFetcher.java:172)
    at com.norconex.collector.http.pipeline.importer.MetadataFetcherStage.executeStage(MetadataFetcherStage.java:51)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:335)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:503)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:390)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:771)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.http.conn.UnsupportedSchemeException: null protocol is not supported
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:108)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
    at com.norconex.collector.http.fetch.impl.GenericMetadataFetcher.fetchHTTPHeaders(GenericMetadataFetcher.java:135)
    ... 11 more
essiembre commented 8 years ago

The latest snapshot fixes this. Please confirm.

essiembre commented 7 years ago

Fix is now in 2.6.1.