documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
831 stars 214 forks source link

Accept non-ascii characters in pdf headers #35

Closed stuartf closed 11 years ago

stuartf commented 12 years ago

an alternative way to handle non-ascii chars in pdf headers, probably not backwards compatible to ruby 1.8

doxavore commented 12 years ago

This didn't make any difference for me on Ruby 1.9.2 - it still doesn't like ®. Stripping them with Iconv works, but obviously loses some data.

stuartf commented 12 years ago

hmm, I was testing on 1.9.3, I didn't think there would be that much difference from 1.9.2...

taufan commented 12 years ago

How bout adding

result.encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')

just below

result = `#{cmd}`.chomp

on line 22 in info_extractor.rb

kendagriff commented 12 years ago

Yeah, this didn't work for me either. Any ideas?

amalagaura commented 11 years ago

This pull request and suggestions were not working, i have a pull request with code that is working for me https://github.com/documentcloud/docsplit/pull/65

knowtheory commented 11 years ago

Closing this as we've merged #65