dewarim / cinnamon4

Cinnamon CMS - version 4
Apache License 2.0
2 stars 0 forks source link

Add ApacheTika service as standalone module / server for Cinnamon #99

Closed dewarim closed 1 year ago

dewarim commented 6 years ago

A simple HTTP servlet which takes a Cinnamon OSD as input and emits the parsed XML representation from Tika. The first implementation should just use a synchronous process, later ones may use asynchronous processing, so content upload is faster from the user's perspective.

Reasoning:

  1. in Cinnamon 3, Tika and Cinnamon server are one big fat jar, and all Tika parsing happens in the main server, with no chance of offloading the processing.
  2. Tika parses a lot of file formats using many different libraries, some of which may have security issues. The ideal configuration would use a container for the Cinnamon Tika service which can be spun up for just one document. This would help limit a system compromise to the Tika container (unless the hypervisor is also hacked...).
dewarim commented 6 years ago

Note: synchronous processing is also useful as afterwards, the Lucene service can index both the OSD as well as the Tika notes in one go.