abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Fix data streaming error for web sourced PDFs #12

Closed abrom closed 4 years ago

gsar commented 4 years ago

@abrom i just ran into a gotcha with this change. in versions before 1.23.1, web sourced documents returned text in UTF-8 encoding, but after the change in this PR, the text is in ASCII-8BIT which results in errors if you try to use it in contexts like regexp matching where the regexp is UTF-8 (that's the ruby default). so this is the difference:

henkei v1.23:

[1] pry(main)> Henkei.new(Kernel.open('http://africau.edu/images/default/sample.pdf')).text.encoding
=> #<Encoding:UTF-8>

henkei v1.23.1:

[1] pry(main)> Henkei.new(Kernel.open('http://africau.edu/images/default/sample.pdf')).text.encoding
=> #<Encoding:ASCII-8BIT>

i think we may need to make a note of this somewhere, as it is certainly an incompatible change. for methods that are supposed to return text, it might be worth calling force_encoding('UTF-8') by default and add an option to allow the user to disable that automatic coersion or use a different encoding. this is obviously highly dependent on file type, so the user knows best what is appropriate for each case, but having a sane default makes sense.

abrom commented 4 years ago

I suspect that's most likely because:

Encoding.aliases['BINARY']
 => "ASCII-8BIT" 

What I've done in the past is something like:

text.force_encoding 'UTF-8'
return text if text.valid_encoding?
text.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')

But that is obviously destructive so not sure it'd be something you'd really want pre-processing the result.

Yes, could be some sort of option to turn the coercion on/off but it starts getting very murky very quickly!