Closed jetnet closed 4 years ago
Yes, there are a few places where you can rely on detection by the crawler instead of relying on the HTTP response. The earliest place I would recommend is right when you fetch the document. Have a look at GenericDocumentFetcher and set the detectCharset
attribute to true
(applies to your HTTP Collector crawler config).
If you want more control and do an explicit conversion from one to another only for some documents, I invite you to look the Importer CharsetTagger
and CharsetTransformer
.
Thanks for the hint with the detectCharset
, I was not aware of that feature.
I tried it, but got the same error as above.
Can you share your config? There is something strange here, as ISO-8859-1
is part of Java standard charsets so it should be supported: https://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html
uploaded: transfer.sh/q1lZb/zt.xml
I was able to reproduce with your config and found the cause. I misread your detected charset. The page appears to be UTF-8 but the server returns Content-Type: text/html; charset=ISO-8559-1
.
I first thought that was the ISO Latin 1 standard which is supported by Java, but that one rather is ISO-8859-1
(close, but two 8s instead of two 5s).
If you know the site owner, you may want to suggest fixing the charset in the HTTP response header.
In any case, I just created a new snapshot version of the HTTP Collector which will log a warning instead of rejecting the document. If you do not like the warnings, you can change the log level to ERROR in log4j.properties
for com.norconex.collector.http.pipeline.importer.MetadataFetcherStage
and com.norconex.collector.http.pipeline.importer.DocumentFetcherStage
.
Please confirm this new version works for you.
Thank you! I'm testing - the crawler does not stop anymore, but it does not extract links from the root page. I'm looking into this. Will let you know how it goes. Thanks again!
the new snapshot has no problems with other websites, but does not extract the links from the root page for this web-site only. There is no crawl-event URLS_EXTRACTED
at all.
The root page of the website contains some malformed <a ..>
tags, but it should not be a problem. I tested the crawler with similar broken tags w/o problems.
So, the only difference I see here - it the warning:
WARN [DocumentFetcherStage] Unsupported character encoding "ISO-8559-1" defined in "Content-Type" HTTP response header.
Setting the debug level to all events does not give any hints. The crawler just does not extract links from the root page.
Could you please take a look, when you have time?
I uploaded a simplified config: transfer.sh/cfraT/zt-simple.xml
I get a Not Found
error when trying to download your file. Can you please attach it here instead?
uploaded again: 0x0.st/ipfR.xml
Try the latest snapshot. The issue here was that charset AND content-type detection were failing because they are both in the same HTTP header that has the wrong charset. By default, the URL extractor expects text/html
and a few equivalents. I updated it that auto-detection is done on the content when parsing fails from the HTTP headers. I tested with your page and it worked. Please confirm.
Great support again! Thank you very much!
hello Pascal,
we have to crawl many web-servers and some of them provide incorrect encoding data, e.g.:
Importer module cannot parse it:
or it could be vice versa: HTTP-header sets correct encoding, the page itself - a wrong one.
I was able to workaround one case by removing the wrong header during the pre-parse stage. But it was rather a hack. I'm wondering, if you could suggest a proper way to handle this.
Thank you!