activewarehouse / activewarehouse-etl

Extract-Transform-Load library from ActiveWarehouse
MIT License
240 stars 102 forks source link

sub!': invalid byte sequence in UTF-8 #116

Closed epinault closed 12 years ago

epinault commented 12 years ago

I am using Ruby 1.9.3 and in some of my file I get the following error

sub!': invalid byte sequence in UTF-8

One way to fix it is to force the options on the line :38 of the csvparser to use encoding: "ISO8859-1"

from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1855:in block in shift' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1849:inloop' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1849:in shift' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1791:ineach' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1208:in block in foreach' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1354:inopen' from /home/emmanuel/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/csv.rb:1207:in foreach' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/parser/csv_parser.rb:38:inblock in each' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/parser/csv_parser.rb:30:in each' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/parser/csv_parser.rb:30:ineach' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/control/source/file_source.rb:45:in each' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:333:ineach_with_index' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:333:in block in process_control' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:327:ineach' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:327:in process_control' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:275:inprocess' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:272:in process' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/engine.rb:55:inprocess' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:82:in block in execute' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:80:ineach' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:80:in execute' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/lib/etl/commands/etl.rb:90:in<top (required)>' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:251:in require' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:251:inblock in require' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:236:in load_dependency' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activesupport-3.2.8/lib/active_support/dependencies.rb:251:inrequire' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/gems/activewarehouse-etl-1.0.0/bin/etl:28:in <top (required)>' from /home/emmanuel/talemetry/talemetry_warehouse/vendor/ruby/1.9.1/bin/etl:19:inload'

epinault commented 12 years ago

Nevermind with that for now... I forced mysql to export to UTF-8 some of the csv file. and it fixed the issue..

thbar commented 12 years ago

My understanding is that your file is encoded in ISO-8859-1 and that you work by default with UTF-8, is that right?

Based on Ruby 1.9 CSV doc, you will have to provide an :encoding option to tell the parser that the source is in ISO-8859-1, or to modify Encoding::default_external (but then it's a general setting affecting all your reads).

You should be able to pass the :encoding option without having to hack the source code (the options are propagated from the DSL to the line you pointed if I'm right).

Alternatively you may want to preprocess the file if you prefer (I tend to do that in a first pass).

Can you check if passing the :encoding option works for you? If it works, we'll close this issue and open a documentation issue instead, this will certainly become a FAQ.

thbar commented 12 years ago

I missed your comment while writing mine! Ok - I'll close this one (but it probably needs some documentation here).

epinault commented 12 years ago

Yes! adding to the doc would help for sure :) That would be a nice to know for the future :)