Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

HTML parser - commented out encoding meta-tag #223

Closed jetnet closed 8 years ago

jetnet commented 8 years ago

hi Pascal, you wont believe, but I just found another encoding issue :smile:

Source code:

   <!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> -->
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The parser cannot recognize the content correctly (for HTML entities is used UTF-8 and for other non-latin chars - ISO-8859-1). If you remove the comment, then the parsing will work. Example: bankamiz.de/tr/tr_index.html

P.S. as a workaround I use the StripBetweenTransformer in the pre-parser chain to remove these comments from the source code.

jetnet commented 8 years ago

I just tested the page with the standalone tika parser - the same issue. So, it's not a norconex issue, but a tika one. Please feel free to delete / close the issue from here. Thanks!

essiembre commented 8 years ago

Let's live it open in case I can apply a fix regardless.

essiembre commented 8 years ago

I have implemented a fix in the Importer module and I made a new HTTP Collector snapshot with it.

Please confirm it solves your problem (it did in my tests).

FYI, I also opened a ticket with Apache Tika here: https://issues.apache.org/jira/browse/TIKA-1837. I will replace my solution with theirs once they release it.

jetnet commented 8 years ago

It works! Great support as usual :clap: And thanks for the opening a tika issue for that, it definitely makes sense to fix the external library.

tballison commented 8 years ago

Fixed in Tika. Thank you for letting us know about this. Sorry for the delay!

essiembre commented 8 years ago

@tballison: Thanks you!