Closed bruce-genhot closed 8 years ago
@bruce-genhot, I'm having a similar issue. Could you let us know how you fixed your problem? Thanks!
Should this ticket be re-opened, or is it the same as #199? If re-opening, please provide details how to reproduce.
@essiembre, I opened #202 after some more investigation. I suspect that #199 will have the same root cause, so please feel free to close that in favour of #202.
Got it. I'll keep both #199 and #202 until I know for sure whether they have the same cause.
@niels , maybe you can see #194 for more details. One potential solution is about using TextBetweenTagger, see my code below.
<tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" inclusive = "false" caseSensitive= "false">
<textBetween name="html">
<start><![CDATA[<html]]></start>
<end><![CDATA[</html>]]></end>
</textBetween>
</tagger>
Hi
Http collector works well for pages with utf-8 encoding, but for pages with other charset, like 'gb2312'. the results are in confusion code. Could you please let me know how can I resolve this problem? thank you. I'd like to convert all content to be in urt-8.