abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Add optional `encoding` argument to set output character encoding #47

Closed SimonBrazell closed 2 months ago

SimonBrazell commented 3 months ago

What's this PR do?

Adds an optional encoding argument to #text, #html and .read for setting the output character encoding, which is passed to Tika as the --encoding option. This value is validated against Ruby's Encoding.name_list, raising an ArgumentError if it isn't included in the list.

Why is it needed?

So we can set the output encoding via Henkei / Tika instead of having to do it ourselves afterwards.

Where should the reviewer start?

How should this be manually tested?

henkei = Henkei.new 'sample.pages'
utf_8_text = henkei.text(encoding: 'UTF-8')
utf_8_text.encoding 
=> #<Encoding:UTF-8>
SimonBrazell commented 2 months ago

@abrom I modified the Open3.popen2 call to more closely match the capture2 source.

https://github.com/ruby/open3/blob/b8909222051b4103a19eba19506727faece252e7/lib/open3.rb#L775