Closed danizen closed 7 years ago
Except for some fixes a very few content types not yet supported by Tika, the Importer relies on Tika for the parsing of all content types by default. You can find default parsing-related configuration options by looking at GenericDocumentParserFactory. For most use cases though, if you are interested in the text content, you should be fine with the defaults. Tika is used for parsing and extracting raw text only (and works well with HTML). The content manipulation is done separately with the Importer handlers (taggers, splitters and transformers).
For instance, if you want to skip certain parts of a page (like stripping side navigation, headers, footers, etc.), you can have a look at available handlers that can help you with that, such as StripBetweenTransformer, ReplaceTransformer or other transformers (listed here).
Thanks.
http://www.norconex.com/collectors/importer/configuration does a good job of laying out the possibilities, but also doesn't quite explain what is the implicit, default configuration for the default Tika-based importer.
Also tracking boiler pipe 1.2.0, which is not in Maven, and which blocks some issues with Tika integration and Solr integration, where they end-up having "Skip navigation" and such as part of the textual content.
Not bad in a bag-of-words model, not so great for contextual snippet generation.