Add optional `encoding` argument to set output character encoding

abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

MIT License

74 stars 14 forks source link

What's this PR do?

Adds an optional encoding argument to #text, #html and .read for setting the output character encoding, which is passed to Tika as the --encoding option. This value is validated against Ruby's Encoding.name_list, raising an ArgumentError if it isn't included in the list.

Why is it needed?

So we can set the output encoding via Henkei / Tika instead of having to do it ourselves afterwards.

Where should the reviewer start?

lib/henkei.rb:222

How should this be manually tested?

henkei = Henkei.new 'sample.pages'
utf_8_text = henkei.text(encoding: 'UTF-8')
utf_8_text.encoding 
=> #<Encoding:UTF-8>

abrom / henkei