Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Replacing space, new-lines, tabs from content #79

Closed wolverline closed 6 years ago

wolverline commented 6 years ago

Hi Pascal,

I'm using 2.8.0 so if this issue has been addressed in 2.9.0, please let me know.

I'm trying to remove whitespace, new-lines, tabs from content. I found same/similar posting and tried several different combinations but none of them seem to work. The following is what I added in the preParseHandlers

<transformer class="${importer}.handler.transformer.impl.ReplaceTransformer" caseSensitive="false">
  <replace>                                                                               
    <fromValue>^\s*[\r\n\t]</fromValue>
    <toValue></toValue>
  </replace>
</transformer>

In this case, it ends up being the following error: com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@850ea86 at com.norconex.importer.p ... Caused by: java.lang.IllegalArgumentException: root cannot be null

If I add this in the postParseHandlers, the error doesn't appear but no effects. Tried with (\r\n|\n|\r|\t|\s) as well but to no avail.

So far, the only xml confg that works is:

<transformer class="${importer}.handler.transformer.impl.ReduceConsecutivesTransformer">
  <reduce>\s</reduce>
  <reduce>\n</reduce>
  <reduce>\r</reduce>
  <reduce>\t</reduce>
  <reduce>\n\t</reduce>
  <reduce>\n\r\t</reduce>
  <reduce>\r\n</reduce>
  <reduce>\s\n</reduce>
  <reduce>\s\r</reduce>
  <reduce>\s\r\n</reduce>
  <reduce>\n\r\s</reduce>
</transformer>

But this doesn't remove all the whitespace (esp. leading space); not satisfying. Is there any way I can remove them all?

wolverline commented 6 years ago

I ended up cleaning up leading & trailing space by adding the following:

<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
  <replace>
    <fromValue>^\s+|\s+$|\s+(?=\s)</fromValue>
    <toValue></toValue>
  </replace>
</transformer>