Closed jetnet closed 8 years ago
I just tested the page with the standalone tika parser - the same issue. So, it's not a norconex issue, but a tika one. Please feel free to delete / close the issue from here. Thanks!
Let's live it open in case I can apply a fix regardless.
I have implemented a fix in the Importer module and I made a new HTTP Collector snapshot with it.
Please confirm it solves your problem (it did in my tests).
FYI, I also opened a ticket with Apache Tika here: https://issues.apache.org/jira/browse/TIKA-1837. I will replace my solution with theirs once they release it.
It works! Great support as usual :clap: And thanks for the opening a tika issue for that, it definitely makes sense to fix the external library.
Fixed in Tika. Thank you for letting us know about this. Sorry for the delay!
@tballison: Thanks you!
hi Pascal, you wont believe, but I just found another encoding issue :smile:
Source code:
The parser cannot recognize the content correctly (for HTML entities is used UTF-8 and for other non-latin chars - ISO-8859-1). If you remove the comment, then the parsing will work. Example:
bankamiz.de/tr/tr_index.html
P.S. as a workaround I use the
StripBetweenTransformer
in the pre-parser chain to remove these comments from the source code.