abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Timeout option #16

Closed matthewford closed 4 years ago

matthewford commented 4 years ago

Hi,

First off thanks for forking yomu, we're running this within sidekiq jobs for text extraction, is there any chance we could get a timeout option?

Some of the pdfs which are small still for some reason take hours to run, and look blocked, a timeout option would be great.

abrom commented 4 years ago

Sounds like a bug in the Apache Tika library. I'd suggest looking to create an issue upstream to try fix that.

As for a timeout option, not sure it should necessarily live inside Henkei. It'd likely need to be something nasty like a call to Timeout::timeout

If you're finding it necessary for your purpose then I'd suggest you look down that path. But I would warn you that there may be nasty consequences (like the Tika process not being cleaned up when in client mode, and potentially locking up the process when in server mode).

This idea has already been raised in #6 and I outlined further my reasons for not being so keen to add timeouts. Too many side effects.

matthewford commented 4 years ago

Fair enough, actually, in the end, we modified the tika jar, we were trying to do OCR and extraction, which was resource-intensive and I think it's a known issue it can get into an infinite loop.

I found a secret 'auto' mode which does extraction first, then falls back to OCR which reduced the number of workers we had blocked to manageable levels.