brianmario / charlock_holmes

Character encoding detection, brought to you by ICU
MIT License
1.03k stars 140 forks source link

Question: Using CharlockHolmes in CSV.foreach #137

Closed dyanagi closed 5 years ago

dyanagi commented 5 years ago

Hi, I would like to use the code CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8' which would be similar to the following without loading an entire file into the memory. Do you have any ideas?

# Loading an entire file 
content = File.read('test2.txt')
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
# I wish to do:
filename = 'very_large_csv_file.csv'
CSV.foreach(filename, encoding: <<<USE CharlockHolmes::Converter.convert instead>>>, headers: true) do |row|
   # ...
end
dyanagi commented 5 years ago

using encoding: does not work well in a few encodings that I know, and I currently use CharlockHolmes::Converter.convert.

dyanagi commented 5 years ago

I've found a solution.

require 'charlock_holmes'
require 'csv'

path = 'path/to/file.csv'
detection = CharlockHolmes::EncodingDetector.detect(File.read(path))

# Avoid ditection accuracy issue in CP932
encoding = detection[:encoding] == 'Shift_JIS' ? 'CP932' : detection[:encoding]

CSV.foreach(path,
            encoding: "#{encoding}:UTF-8",
            headers: true) do |row|
  p row.inspect
end