4teamwork / ftw.tika

This product integrates Apache Tika for full text indexing with Plone.
4 stars 1 forks source link

Pass MIME type and file extension to Tika #7

Open lukasgraf opened 10 years ago

lukasgraf commented 10 years ago

Currently a temporary file without file extension is used to store the original document passed to Tika.

We probably should

lukasgraf commented 10 years ago

It seems the Apache Tika command line interface doesn't support passing in the MIME type of the document (or any additional metadata for that matter). Tika's Detector Interface would consider such metadata, but the metadata argument seems to be only exposed in the Tika API, not the command line interface.

lukasgraf commented 10 years ago

So this leaves us with one option: Set the file extension of the temporary file, and let Tika's MIME type detection do its work.

The Tika Content Detection docs say that Tika

The command line interface help describes a switch

-d  or --detect        Detect document type

Which seems to be enabled by default (otherwise, converting a temporary file with no extension wouldn't have worked). Still, we should probably enable this switch to be sure content type detection is always performed.