Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

file size limit #51

Closed angelo337 closed 7 years ago

angelo337 commented 7 years ago

hi there I have a question regarding the importer, Is it possible to limit the content size of a File, I am having issues with a some large files in MS-Excel, and I would like to just index a couple first MB instead of the 45MB os the full file.

could you please point me out some resource or give me some clues how to deal with that large files? thanks a lot best regards angelo

essiembre commented 7 years ago

The simplest is probably to use the SubstringTransformer as a post-parse handler.

<transformer class="com.norconex.importer.handler.transformer.impl.SubstringTransformer"
          end="10000"/>

The above example will truncate after the 10,000 character (exclusive).

Does that work for you?

angelo337 commented 7 years ago

thanks a lot for your fast answer, I will try it and let you know