abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Henkei.read outputs Unicode as bytes #28

Closed andrewngo closed 2 years ago

andrewngo commented 2 years ago

I have a PDF with some unicode text (e.g. [Text üŇÍ文字]) but the output is different with the new version. Looks like it's printing bytes. Any ideas?

require 'henkei'
data = File.binread 'path_to_pdf.pdf'
txt = Henkei.read :html, data
puts txt

v1.21.0

<p>&uuml;Ň&Iacute;文字名 &uuml;Ň&Iacute;文字姓 [Text &uuml;Ň&Iacute;文字]</p>

v2.3.0.1

<p>&uuml;\xC5\x87&Iacute;\xE6\x96\x87\xE5\xAD\x97\xE5\x90\x8D &uuml;\xC5\x87&Iacute;\xE6\x96\x87\xE5\xAD\x97\xE5\xA7\x93 [Text &uuml;\xC5\x87&Iacute;\xE6\x96\x87\xE5\xAD\x97]</p>

System info (Mac)

ruby -v
"ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-darwin20]"
abrom commented 2 years ago

Henkei is very much a thin wrapper around Apache Tika. It would seem that v1.21 of Tika might be returning the results with different encoding to v2.3.0 😉

Have you checked what the encoding of the two different strings are? ie '....'.encoding

It would likely also come down to what your system default encoding is. ie for me it's UTF-8 so:

> "<p>&uuml;\xC5\x87&Iacute;\xE6\x96\x87\xE5\xAD\x97\xE5\x90\x8D &uuml;\xC5\x87&Iacute;"
=> "<p>&uuml;Ň&Iacute;文字名 &uuml;Ň&Iacute;"

You can change/force the encoding with encode('UTF-8') or possibly force_encoding('UTF-8') (the difference being that the first will try transcode characters from the different sets, vs just ignoring the differences and trying to render as is.. but again.. that's really going to come down to how your system is configured

andrewngo commented 2 years ago

Thanks @abrom. Using .force_encoding('UTF-8') works as a workaround for me.

Worth noting that line breaks on Windows are now adding a carriage return (\r\n vs \n) on this version, which is also unexpected but easily solved. Not sure if there are any configurations to control this.