chrismattmann / tika-similarity

Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Apache License 2.0
106 stars 59 forks source link

Tika similarity scripts do not account for Tika Server failing #106

Open augustopartida opened 8 months ago

augustopartida commented 8 months ago

The Jaccard, Edit Distance and Cosine distance similarity scripts do not consider restarting the Tika server on failure. This is evident when the Tika server processes too many files (100K) or the Tika server unexpectedly stops processing requests.