4teamwork / ftw.tika

This product integrates Apache Tika for full text indexing with Plone.
4 stars 1 forks source link

Tika daemon support for performance boost #12

Closed jone closed 10 years ago

jone commented 10 years ago

Running Tika as a server is much faster because the JVM is no longer booted for every file / conversion.

Enhancements

This pull request adds support for the Tika daemon. The meta-directive is extended to also accept the host and port options. When configured, ftw.tika will automatically switch to daemon mode and contact the server with the configured port (/ host).

If the server is not running, ftw.tika will automatically fall back to to executing Tika directly using the configured path to the jar file.

Productive installation

ftw-buildouts provides a tika-server.cfg that can be used when the deployment buildout is based on ftw.buildout's deployment.cfg. The tika-server.cfg downloads Tika, creates a server-script registered in supervisor and configures ftw.tika (ZCML).

More details about how to install it using buildout are described in the updated readme.

Performance Test

I've created 100 docx-files in Plone with random content and length and updated the SearchableText, once with the "old" method by firing up Tika for every file and once by using a Tika server.

The results:

Method Duration for 100 files Duration per file
Non-Daemon 110 seconds 1.1 seconds
Daemon 6.72 seconds 0.0672 seconds

@lukasgraf can you take a look at my changes? /cc @maethu

lukasgraf commented 10 years ago

Just tested it locally, works beautifully! :+1: And the performance boost is pretty significant :grinning:

jone commented 10 years ago

@lukasgraf I've squashed the commits.

lukasgraf commented 10 years ago

Thanks! :+1: