SFMOCI / openlaw

Downloading, cloning, forking or using these files in any way indicates that you've read and accept the terms of use in the enclosed notice.
https://github.com/SFMOCI/openlaw/blob/master/Notice.md
MIT License
109 stars 12 forks source link

Encoded txt files as UTF-8 #4

Closed seanknox closed 10 years ago

jasonlally commented 10 years ago

Thanks for doing that. Did you do this manually or programmatically? Reason I ask, is that our vendor will be pushing updates to FTP that we then automatically add and commit to the repo. I'm seeing if they can just make sure to save with UTF-8 encoding on their end, but just in case, I may need to script this so it doesn't have to be done manually.

seanknox commented 10 years ago

Programmatically. Here's my quick hack:

require 'charlock_holmes'

detector = CharlockHolmes::EncodingDetector.new

ARGV.each do|f|
  content = File.read(f)
  detection = detector.detect(content)
  puts "#{f} encoding: #{detection[:encoding]}"
  utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
  File.write(f, utf8_encoded_content)
end
seanknox commented 10 years ago

That's possible. I'm not sure there's a way to have the transcoder be a bit smarter about characters like that, but I'll look. Think the best way forward is to have the vendor encode as UTF-8 directly.

jasonlally commented 10 years ago

2 addressed