Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Crawling from some URLs is not possible #470

Closed evaso closed 6 years ago

evaso commented 6 years ago

Crawling some urls with the following configuration (see below) works the crawler just fine. But with a few common urls it gives unexpectedly the error message (The real url name is intentionally changed):

ERROR - Could not extract links from:  some-url.com
      java.io.UnsupportedEncodingException: IBM420_ltr
    at java.base/sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71)
    at java.base/java.io.InputStreamReader.<init>(InputStreamReader.java:100)
    at com.norconex.collector.http.url.impl.GenericLinkExtractor.extractLinks(GenericLinkExtractor.java:337)
    at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.base/java.lang.Thread.run(Thread.java:844)

In this case crawler ends very fast

 INFO -            URLS_EXTRACTED: http://some-url.com/
 INFO -       REJECTED_UNMODIFIED: http://some-url.com/
 INFO - some-url.com: Deleting orphan references (if any)...
 INFO - some-url.com: Deleted 0 orphan references...
 INFO - some-url.com: Crawler finishing: committing documents.
 INFO - Elasticsearch RestClient closed.
 INFO - some-url.com: 1 reference(s) processed.
 INFO -          CRAWLER_FINISHED
 INFO - some-url.com: Crawler completed.
 INFO - some-url.com: Crawler executed in 0 second.
 INFO - some-url.com: Closing sitemap store...
 INFO - Running some-url.com: END (Wed Feb 28 09:27:18 CET 2018)

That is the crawler's configuration I use:

<httpcollector id="Company-Crawler">
  #set($workdir = "/opt/norconex/http-collector/test-output")
  #set($core      = "com.norconex.collector.core")
  #set($http      = "com.norconex.collector.http")
  #set($committer = "com.norconex.committer")

  #set($httpClientFactory = "${http}.client.impl.GenericHttpClientFactory")
  #set($filterExtension   = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef    = "${core}.filter.impl.RegexReferenceFilter")
  #set($filterRegexMeta   = "${core}.filter.impl.RegexMetadataFilter")
  #set($robotsTxt         = "${http}.robot.impl.StandardRobotsTxtProvider")
  #set($robotsMeta        = "${http}.robot.impl.StandardRobotsMetaProvider")
  #set($urlNormalizer     = "${http}.url.impl.GenericURLNormalizer")
  #set($esCommitter       = "${committer}.elasticsearch.ElasticsearchCommitter")
  #set($nodesUrl          = "http://192.168.215.213:9200")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlerDefaults>
    <userAgent>Norconex Collector-Http</userAgent>
    <urlNormalizer class="$urlNormalizer" />
    <numThreads>3</numThreads>
    <maxDepth>5</maxDepth>
    <workDir>$workdir</workDir>
    <keepDownloads>false</keepDownloads>
    <orphansStrategy>DELETE</orphansStrategy>
    <robotsTxt ignore="true" />
    <robotsMeta ignore="true" />
    <sitemapResolverFactory ignore="true" />

    <referenceFilters>
        <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,jpeg,bmp,pdf,ico,css,js,sit,eps,wmf,ppt,mpg,mp4,xls,rpm,mov,exe</filter> 
    </referenceFilters>
  </crawlerDefaults>

 <crawlers>
    <crawler id="some-url.com">
        <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
            <url>http://some-url.com</url>
        </startURLs>
        <committer class="$esCommitter">
            <nodes>$nodesUrl</nodes>
            <discoverNodes>true</discoverNodes>
            <indexName>suchindex</indexName>
            <typeName>webdoc</typeName>
            <targetContentField>body</targetContentField>
            <queueSize>100</queueSize>
            <commitBatchSize>500</commitBatchSize>
            <ignoreResponseErrors>true</ignoreResponseErrors>
        </committer>
    </crawler> 
  </crawlers>
</httpcollector>

I use the latest version of crawler and java version I use is: Java(TM) SE Runtime Environment (build 9.0.4+11)

Please give me an advice on how to set the crawler to fix this error.

Thank you.

essiembre commented 6 years ago

Without a URL I can't reproduce, but from the error, it seems that the encoding detected is not supported by your JRE (IBM420_ltr).

Short of changing the encoding of the page causing the issue, you can try to force the link extractor to read links as a different encoding. For example, you can try adding the following under your crawler configuration to use UTF-8:

<linkExtractors>
 <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" charset="UTF-8"/>
</linkExtractors>
evaso commented 6 years ago

That was exactly the problem. Your advice has fixed my problem. Thank a lot Pascal :+1: