Page encoding - Githubissues

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 68 forks source link

Page encoding #192

Closed bruce-genhot closed 8 years ago

bruce-genhot commented 8 years ago

Http collector works well for pages with utf-8 encoding, but for pages with other charset, like 'gb2312'. the results are in confusion code. Could you please let me know how can I resolve this problem? thank you. I'd like to convert all content to be in urt-8.

niels commented 8 years ago

@bruce-genhot, I'm having a similar issue. Could you let us know how you fixed your problem? Thanks!

essiembre commented 8 years ago

Should this ticket be re-opened, or is it the same as #199? If re-opening, please provide details how to reproduce.

niels commented 8 years ago

@essiembre, I opened #202 after some more investigation. I suspect that #199 will have the same root cause, so please feel free to close that in favour of #202.

essiembre commented 8 years ago

Got it. I'll keep both #199 and #202 until I know for sure whether they have the same cause.

bruce-genhot commented 8 years ago

@niels , maybe you can see #194 for more details. One potential solution is about using TextBetweenTagger, see my code below.

<tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive = "false" caseSensitive= "false">
                        <textBetween name="html">
                            <start><![CDATA[<html]]></start>
                            <end><![CDATA[</html>]]></end>
                        </textBetween>
                    </tagger>