Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

ExternalTransformer: INPUT gets corrupted #100

Closed jetnet closed 4 years ago

jetnet commented 5 years ago

hello Pascal,

I'd like to generate a thumbnail image for every incoming document.contentFamily = image using an ExternalTransformer script with ImageMagick tools. But it seems, the provided binary content via STDIN or via ${INPUT} gets corrupted: I'm getting the same file size, but the binary content differs, e.g.: orig:

00000090: 0000 1390 4944 4154 789c ed9d 7994 5cc5  ....IDATx...y.\.

from the transformer:

00000090: 0000 133f 4944 4154 789c ed3f 7994 5cc5  ...?IDATx..?y.\.

The transformer config looks like:

  <transformer class="com.norconex.importer.handler.transformer.impl.ExternalTransformer">
      <restrictTo caseSensitive="false"
              field="document.contentFamily">image</restrictTo>
      <command>
          /path/thumbnails.sh -m ${INPUT_META} -i ${INPUT}
      </command>
      <metadata inputFormat="properties">
          <pattern field="thumbnailImage" valueGroup="1" caseSensitive="false"><![CDATA[thumbnailImage = (.+)]]></pattern>
      </metadata>
      <tempDir>/tmp</tempDir>
  </transformer>

The thumbmails.sh's output is like (when testing):

thumbnailImage = /9j/4AAQSkZJRgABAQAAAQABAAD...

So, the question is - does the ExternalTransformer support binary content? Thanks!

BTW: is there any better solution for thumbnail generation?

jetnet commented 5 years ago

I made some progress by moving that transformer to the fist place in the pre-processing chain. The following one was before and seemed to cause this issue:

<!-- Simple UTF-8 detector: checking, if the page contains UTF-8 encoded umlauts -->
<tagger class="$CountMatchesTagger">
      <countMatches toField="uml_utf8_count" regex="true">[\u00E4\u00C4\u00F6\u00D6\u00FC\u00DC\u00DF]</countMatches>
</tagger>

Now I'm getting the expected results, but NOT everytime. Sometimes the following exception occurs:

Exception in thread "StreamConsumer-STDOUT" java.lang.NullPointerException
        at com.norconex.commons.lang.io.CachedStreamFactory$MemoryTracker.hasEnoughAvailableMemory(CachedStreamFactory.java:151)
        at com.norconex.commons.lang.io.CachedOutputStream.write(CachedOutputStream.java:145)
        at java.io.OutputStream.write(OutputStream.java:75)
        at com.norconex.importer.handler.transformer.impl.ExternalTransformer.writeLine(ExternalTransformer.java:706)
        at com.norconex.importer.handler.transformer.impl.ExternalTransformer.access$000(ExternalTransformer.java:299)
        at com.norconex.importer.handler.transformer.impl.ExternalTransformer$1.lineStreamed(ExternalTransformer.java:671)
        at com.norconex.commons.lang.io.InputStreamLineListener.flushBuffer(InputStreamLineListener.java:117)
        at com.norconex.commons.lang.io.InputStreamLineListener.streamed(InputStreamLineListener.java:93)
        at com.norconex.commons.lang.io.InputStreamConsumer.fireStreamed(InputStreamConsumer.java:150)
        at com.norconex.commons.lang.io.InputStreamConsumer.run(InputStreamConsumer.java:98)
jetnet commented 5 years ago

could it be a race condition issue? when some processor or transformer have read the buffer, then it'd not be available to the next ones?

jetnet commented 5 years ago

one more update - I guess I got it working finally. The solution was to save the $INPUT at the beginning of the script and provide it back as $OUTPUT. So, looks like, the ExternalTransformer consumes the input buffer and no content is available for further processors.

essiembre commented 5 years ago

I am glad you have it working, but I am not sure what you mean by:

So, looks like, the ExternalTransformer consumes the input buffer and no content is available for further processors.

Can you share your config snippet illustrating what you did and/or explain further?

jetnet commented 5 years ago

The current configuration is quite large. I'll try to reproduce the issue with the transformer config from above only.