Crawling from some URLs is not possible

evaso commented 6 years ago

Crawling some urls with the following configuration (see below) works the crawler just fine. But with a few common urls it gives unexpectedly the error message (The real url name is intentionally changed):

ERROR - Could not extract links from:  some-url.com
      java.io.UnsupportedEncodingException: IBM420_ltr
    at java.base/sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71)
    at java.base/java.io.InputStreamReader.<init>(InputStreamReader.java:100)
    at com.norconex.collector.http.url.impl.GenericLinkExtractor.extractLinks(GenericLinkExtractor.java:337)
    at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
    at java.base/java.lang.Thread.run(Thread.java:844)

In this case crawler ends very fast

 INFO -            URLS_EXTRACTED: http://some-url.com/
 INFO -       REJECTED_UNMODIFIED: http://some-url.com/
 INFO - some-url.com: Deleting orphan references (if any)...
 INFO - some-url.com: Deleted 0 orphan references...
 INFO - some-url.com: Crawler finishing: committing documents.
 INFO - Elasticsearch RestClient closed.
 INFO - some-url.com: 1 reference(s) processed.
 INFO -          CRAWLER_FINISHED
 INFO - some-url.com: Crawler completed.
 INFO - some-url.com: Crawler executed in 0 second.
 INFO - some-url.com: Closing sitemap store...
 INFO - Running some-url.com: END (Wed Feb 28 09:27:18 CET 2018)

That is the crawler's configuration I use:

<httpcollector id="Company-Crawler">
  #set($workdir = "/opt/norconex/http-collector/test-output")
  #set($core      = "com.norconex.collector.core")
  #set($http      = "com.norconex.collector.http")
  #set($committer = "com.norconex.committer")

  #set($httpClientFactory = "${http}.client.impl.GenericHttpClientFactory")
  #set($filterExtension   = "${core}.filter.impl.ExtensionReferenceFilter")
  #set($filterRegexRef    = "${core}.filter.impl.RegexReferenceFilter")
  #set($filterRegexMeta   = "${core}.filter.impl.RegexMetadataFilter")
  #set($robotsTxt         = "${http}.robot.impl.StandardRobotsTxtProvider")
  #set($robotsMeta        = "${http}.robot.impl.StandardRobotsMetaProvider")
  #set($urlNormalizer     = "${http}.url.impl.GenericURLNormalizer")
  #set($esCommitter       = "${committer}.elasticsearch.ElasticsearchCommitter")
  #set($nodesUrl          = "http://192.168.215.213:9200")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlerDefaults>
    <userAgent>Norconex Collector-Http</userAgent>
    <urlNormalizer class="$urlNormalizer" />
    <numThreads>3</numThreads>
    <maxDepth>5</maxDepth>
    <workDir>$workdir</workDir>
    <keepDownloads>false</keepDownloads>
    <orphansStrategy>DELETE</orphansStrategy>
    <robotsTxt ignore="true" />
    <robotsMeta ignore="true" />
    <sitemapResolverFactory ignore="true" />

    <referenceFilters>
        <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,jpeg,bmp,pdf,ico,css,js,sit,eps,wmf,ppt,mpg,mp4,xls,rpm,mov,exe</filter> 
    </referenceFilters>
  </crawlerDefaults>

 <crawlers>
    <crawler id="some-url.com">
        <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
            <url>http://some-url.com</url>
        </startURLs>
        <committer class="$esCommitter">
            <nodes>$nodesUrl</nodes>
            <discoverNodes>true</discoverNodes>
            <indexName>suchindex</indexName>
            <typeName>webdoc</typeName>
            <targetContentField>body</targetContentField>
            <queueSize>100</queueSize>
            <commitBatchSize>500</commitBatchSize>
            <ignoreResponseErrors>true</ignoreResponseErrors>
        </committer>
    </crawler> 
  </crawlers>
</httpcollector>

I use the latest version of crawler and java version I use is: Java(TM) SE Runtime Environment (build 9.0.4+11)

Please give me an advice on how to set the crawler to fix this error.

Thank you.

essiembre commented 6 years ago

Without a URL I can't reproduce, but from the error, it seems that the encoding detected is not supported by your JRE (IBM420_ltr).

Short of changing the encoding of the page causing the issue, you can try to force the link extractor to read links as a different encoding. For example, you can try adding the following under your crawler configuration to use UTF-8:

<linkExtractors>
 <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" charset="UTF-8"/>
</linkExtractors>

evaso commented 6 years ago

That was exactly the problem. Your advice has fixed my problem. Thank a lot Pascal :+1:

Norconex / crawlers

Crawling from some URLs is not possible #470