Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

What is XML equivalent to default importer configuration? #42

Closed danizen closed 7 years ago

danizen commented 7 years ago

http://www.norconex.com/collectors/importer/configuration does a good job of laying out the possibilities, but also doesn't quite explain what is the implicit, default configuration for the default Tika-based importer.

Also tracking boiler pipe 1.2.0, which is not in Maven, and which blocks some issues with Tika integration and Solr integration, where they end-up having "Skip navigation" and such as part of the textual content.

Not bad in a bag-of-words model, not so great for contextual snippet generation.

essiembre commented 7 years ago

Except for some fixes a very few content types not yet supported by Tika, the Importer relies on Tika for the parsing of all content types by default. You can find default parsing-related configuration options by looking at GenericDocumentParserFactory. For most use cases though, if you are interested in the text content, you should be fine with the defaults. Tika is used for parsing and extracting raw text only (and works well with HTML). The content manipulation is done separately with the Importer handlers (taggers, splitters and transformers).

For instance, if you want to skip certain parts of a page (like stripping side navigation, headers, footers, etc.), you can have a look at available handlers that can help you with that, such as StripBetweenTransformer, ReplaceTransformer or other transformers (listed here).

danizen commented 7 years ago

Thanks.