Sicos1977 / IFilterTextReader

A reader that gets text from different file formats through the IFilter interface
Other
55 stars 38 forks source link

Provide more user control over character cleanup #44

Closed richardtallent closed 4 years ago

richardtallent commented 4 years ago

Hi! I'm finding that with the legal documents I'm reading, the "cleanup" routine is removing characters I actually need (section symbols, degrees, etc.), and it may also be applying "-" after spurious "word breaks" coming from the legacy Word reader.

This PR suggests two new fields for the FilterReader class to provide the caller with more control over these substitutions. This would be a non-breaking change, since it defaults to the previous behavior.

Thanks!

richardtallent commented 4 years ago

BTW, this is a long-past-due followup to issue #37.

Sicos1977 commented 4 years ago

I moved your new options to the FilterReaderOptions class, I designed this one to control the way the filterreader works. Also just released a new nuget package with these new options.

richardtallent commented 4 years ago

Thanks, this is perfect! I glanced at FilterReaderOptions but for some reason thought it was only used to pass native options back to IFilter.