Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Change content field based on contentType #88

Closed mnemonictrick closed 5 years ago

mnemonictrick commented 5 years ago

Hi there,

I'm trying to change the content that will be committed based on the contentType. That means, I'm trying to submit the "description"-Field for PDF files, rather than the original content.

Until now I've tried with CopyTagger, both in preParseHandlers and postParseHandlers. Both won't work.

                <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
                   <restrictTo caseSensitive="false" field="document.contentType">
                       application/pdf
                   </restrictTo>
                   <copy fromField="description" toField="content" overwrite="true" />
               </tagger>

What would be the correct way to do this?

Thank you very much!

essiembre commented 5 years ago

The description field (if present) should be extracted by the parsing done by the Importer module. That means it should definitely be a post-parse handler.

That being said, I see nothing obviously wrong with what you have. A few ideas:

If none of this helps, can you share your full config for further review? With a "faulty" PDF if possible in order to reproduce.

mnemonictrick commented 5 years ago

Hi essiembre,

thank you for your quick reply.

But there is one (strange?) thing I discovered: While going through the output of the DebugTagger I noticed the following output:

MyProject: 2019-01-07 07:38:49 INFO - content=Content of description-field

The output (local testing with JSONFileCommitter) instead isn't changed:

... "doc-add": { "reference": ..., "metadata": "ObservableMap [map=ObservableMap [map=ObservableMap [map=....", "content": "Made by\n2,21\n1,96..." }

Is there any other hint you could give me? The PDF doesn't seem to be the problem...

Thank you so much!

mnemonictrick commented 5 years ago

Hi,

we couldn't manage to change the "content"-field. So we used the description field and added it to the submitted fields. Afterwards the frontend modules will decide, which fieldvalue to display.