Closed evaso closed 6 years ago
Without a URL I can't reproduce, but from the error, it seems that the encoding detected is not supported by your JRE (IBM420_ltr
).
Short of changing the encoding of the page causing the issue, you can try to force the link extractor to read links as a different encoding. For example, you can try adding the following under your crawler configuration to use UTF-8:
<linkExtractors>
<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" charset="UTF-8"/>
</linkExtractors>
That was exactly the problem. Your advice has fixed my problem. Thank a lot Pascal :+1:
Crawling some urls with the following configuration (see below) works the crawler just fine. But with a few common urls it gives unexpectedly the error message (The real url name is intentionally changed):
In this case crawler ends very fast
That is the crawler's configuration I use:
I use the latest version of crawler and java version I use is: Java(TM) SE Runtime Environment (build 9.0.4+11)
Please give me an advice on how to set the crawler to fix this error.
Thank you.