abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Fix issue where certain files would cause Errno::EPIPE exception to be raised #9

Closed abrom closed 4 years ago

abrom commented 4 years ago

Fixes #7

Use Open3 instead of IO library for calling to Tika.

abrom commented 4 years ago

FYI @rywall. Keeps the default still using the IO pipe method, but allows you to use the Ruby Shell alternative with an initialiser param (or via the class .read method).

rywall commented 4 years ago

@abrom perfect, thanks!

abrom commented 4 years ago

@rywall I was having some major flake issues with the solution I'd originally come up with. So I've gone back to the drawing board and come up with another solution that appears to be able to read your problem file without any errors.

I've tested it on a bunch of my files as far as I can tell it's all looking good..

However I'd really like to get your input on whether this works for you on a larger set of data before merging. Can you please check this branch out and try it on some of the other files you've previously had problems with (and of course check that there aren't any regressions).