abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Errno::EPIPE on certain png images #7

Closed rywall closed 4 years ago

rywall commented 5 years ago

First off, great work on this gem. It works amazingly well 99.9% of the time.

I've encountered certain png images (one uploaded as test2.png) that produce an Errno::EPIPE error when calling .text. I would expect either an empty string (like test1.png) or a more intelligible error message from Henkei.

test1.png

test
> Henkei.new("test1.png").text
=> ""

test2.png

> Henkei.new("test2.png").text
Errno::EPIPE (Broken pipe)
test2
abrom commented 5 years ago

Thanks, although I've just been updating Tika as new versions are released. The gem is mostly the work of https://github.com/yomurb/yomu (been inactive for some time).

Hmm so it seems like Tika is closing the pipe before Henkei has finished writing the image file. Piping the file on the console works fine so it's something in either reading, or the writing of the file within Henkei, although I couldn't say what just yet.

Can Tika even extract text out of a PNG? Or is this just a repeatable use-case you've found?

One option would be to capture a pipe exception and handle it in a more gracious manner. Return nil? Raise some other exception? Hmm.. not a fan

This is certainly not the first time this issue has come up.. see https://github.com/yomurb/yomu/issues/7 (unresolved)

I'll have a look to see if there is a better way to pipe the data into Tika, but open to suggestions

rywall commented 5 years ago

I'm actually not sure if Tika can extract text from a png or not. My app just tries to extract text from any uploaded file.

I agree that ideally the root of the problem would be fixed, but even just being able to rescue a Henkei::TikaError or something like that instead of Errno::EPIPE would be an improvement IMO.

Thoughts?

abrom commented 5 years ago

interestingly, using the following in the client_read method results in just an empty string returned (expected):

    sh = Shell.new
    (sh.echo(data) | sh.system(tika_command(type))).to_s

although I need to do some more research into the differences between writing to a Ruby IO vs using Ruby Shell echo.

On the limited number of files I've tested it with, I get the expected results.

abrom commented 4 years ago

@rywall did you get a chance to try my suggestion?

rywall commented 4 years ago

@abrom I've been using your suggestion in production for the past couple of weeks and it seems to be working great. 😄