Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

[Q] Multiple fieldMatcher in a handler (v.3.x) #120

Open jetnet opened 2 years ago

jetnet commented 2 years ago

hello Pascal,

I'd like to use several methods (e.g. csv and regex) in the KeepOnlyTagger, but it seems, only one fieldMatcher is allowed:

<handler class="$KeepOnlyTagger">
         <fieldMatcher method="csv">crawl_date,type,content,collector.depth,document.language</fieldMatcher>
         <fieldMatcher method="regex">(thumbnailImage|imagePHash).*</fieldMatcher>
 </handler>

Error:

1 XML configuration errors detected:

[XML] StartCommand: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fieldMatcher'. One of '{restrictTo}' is expected.

How to do that with the 3.x? Thanks!

essiembre commented 2 years ago

It currently allows only one by design. The solution would be to merge your two matchers into a single one.

It would be nice to be able to use many. I am marking this as a feature request.