Closed chrismattmann closed 7 years ago
Yes, you right. My first version used Apache Tika with REST: https://github.com/SpamScope/spamscope/commit/e0f580859b8ebee9019b2adc84312cfe8116adce
I replaced but maybe my idea was wrong. So I'm thinking to rollback to REST version and use this docker https://hub.docker.com/r/fmantuano/apache-tika-server/
If you want help me, I will very happy.
The Tika Python library uses the REST server (which is faster than CMD line calls in Java to Tika APP since the REST server doesn't need to reload Tika config and the JVM each time). In addition you don't need to worry about the location of the Tika jar file (and install it separately). It will manage all that for you.
Looks like you would just update requirements.txt to use pip install tika, and then make whatever necessary updates. If you want I can send a PR.