Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

ReduceConsecutivesTransformer behavior #33

Closed OkkeKlein closed 8 years ago

OkkeKlein commented 8 years ago

My parsed content has a lot of CRLF I am trying to clean up. Should

<reduce>\r</reduce>
<reduce>\r\n</reduce>
<reduce>\n</reduce>

be working or is the \r\n not supported?

essiembre commented 8 years ago

What is the parent tag for these reduce tags? Assuming it is <transformer class="com.norconex.importer.handler.transformer.impl.ReduceConsecutivesTransformer" ...> then yes it should work.

Can you attach a sample document and the entire transformer config snippet?

OkkeKlein commented 8 years ago

My guess is this didn't work because the regex would have to be multiline or something. I fixed it by transforming /n and /r to whitespace and then do a reduce with \s.

Works for me.

svanschalkwyk commented 4 years ago

Works as designed if one uses it as a postParseHandler.