abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
MIT License
74 stars 14 forks source link

Fix data streaming error for web sourced PDFs #12

Closed abrom closed 4 years ago

gsar commented 4 years ago

@abrom i just ran into a gotcha with this change. in versions before 1.23.1, web sourced documents returned text in UTF-8 encoding, but after the change in this PR, the text is in ASCII-8BIT which results in errors if you try to use it in contexts like regexp matching where the regexp is UTF-8 (that's the ruby default). so this is the difference:

henkei v1.23:

[1] pry(main)> Henkei.new(Kernel.open('http://africau.edu/images/default/sample.pdf')).text.encoding
=> #<Encoding:UTF-8>

henkei v1.23.1:

[1] pry(main)> Henkei.new(Kernel.open('http://africau.edu/images/default/sample.pdf')).text.encoding
=> #<Encoding:ASCII-8BIT>

i think we may need to make a note of this somewhere, as it is certainly an incompatible change. for methods that are supposed to return text, it might be worth calling force_encoding('UTF-8') by default and add an option to allow the user to disable that automatic coersion or use a different encoding. this is obviously highly dependent on file type, so the user knows best what is appropriate for each case, but having a sane default makes sense.

abrom commented 4 years ago

I suspect that's most likely because:

 => "ASCII-8BIT" 

What I've done in the past is something like:

text.force_encoding 'UTF-8'
return text if text.valid_encoding?
text.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')

But that is obviously destructive so not sure it'd be something you'd really want pre-processing the result.

Yes, could be some sort of option to turn the coercion on/off but it starts getting very murky very quickly!