Closed wolverline closed 6 years ago
I ended up cleaning up leading & trailing space by adding the following:
<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
<replace>
<fromValue>^\s+|\s+$|\s+(?=\s)</fromValue>
<toValue></toValue>
</replace>
</transformer>
Hi Pascal,
I'm using 2.8.0 so if this issue has been addressed in 2.9.0, please let me know.
I'm trying to remove whitespace, new-lines, tabs from content. I found same/similar posting and tried several different combinations but none of them seem to work. The following is what I added in the
preParseHandlers
In this case, it ends up being the following error:
com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@850ea86 at com.norconex.importer.p ...
Caused by: java.lang.IllegalArgumentException: root cannot be null
If I add this in the
postParseHandlers
, the error doesn't appear but no effects. Tried with(\r\n|\n|\r|\t|\s)
as well but to no avail.So far, the only xml confg that works is:
But this doesn't remove all the whitespace (esp. leading space); not satisfying. Is there any way I can remove them all?