Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Question on ReplaceConsecutivesTransformer #43

Closed danizen closed 7 years ago

danizen commented 7 years ago

Can I get this to apply to the content, and smush it all into a single-line?

Thanks

essiembre commented 7 years ago

No, the ReduceConsecutivesTransformer will keep one instance of what you specify.

If you want more control, use the ReplaceTransformer. For instance, this should do what you want (put everything on one line):

    <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
      <replace>
        <fromValue>\s+</fromValue>
        <toValue xml:space="preserve"> </toValue>
      </replace>
    </transformer>

Tags containing only white-spaces are stripped by default. To preserve them, you need to add xml:space="preserve like above.

danizen commented 7 years ago

OK - am I right in understanding that transformers are run on the content and taggers are run on the meta-data, or do I have a misunderstanding.

essiembre commented 7 years ago

That is exactly the idea, yes. Taggers can also read the content, but only transformers can modify it. Technically, transformers can be implemented to deal with both if you need to, but the the ones available focus on content only.