abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Support timeouts when shelling to tika #6

Closed phallguy closed 4 years ago

phallguy commented 5 years ago

We've had issues sending some data to tika via pipes that leave our process hung waiting for the data to finish piping to tika. This adds support for wrapping the IO calls in an optional timeout to abort the process if it is taking too long.

abrom commented 5 years ago

I appreciate the effort you've put in but the Ruby Timeout sledgehammer is more likely to create more problems than it solves. I've seen these first hand with socket related comms and they can be a royal pain to debug. I haven't had any experience messing with pipes and Ruby Timeout. Only thing I could find was this stackoverflow question:

https://stackoverflow.com/questions/17237743/timeout-within-a-popen-works-but-popen-inside-a-timeout-doesnt

In short, doing what you've proposed. Although that user ends up nuking the io process in the cleanup (rather than closing - although I haven't looked in the io.c to see how it internally handles close?!).

Have you looked to see at what stage it's timing out? Ie write, close_write or read ?

Is it an issue based on file size? Or some specific content?

Have you tried the different versions of Tika? The major/minor versions of the gem match the Tika versions so it might be useful to know if this is an upstream regression that they may have introduced?

abrom commented 4 years ago

@phallguy I've been working on another issue which changes the way the data was being passed to Tika. This would only apply to the 'client' functions, but if that's what you're using it'd be interesting to know if that also solves your timeout problem.

Check out the branch for #7