bigcartel / dugway

Easily build and test Big Cartel themes.
https://developers.bigcartel.com/api/themes
MIT License
149 stars 22 forks source link

Encoding issue (incompatible character encodings) with real store data (product names, categories...) #68

Closed danrydell closed 11 years ago

danrydell commented 11 years ago

Hi, I get this error "incompatible character encodings: CP850 and UTF-8" when I test with a real store in config.ru that has non ascii characters in product names, for instance. Tested it with my own store in spanish, and other spanish stores. Is this an issue? If not, where can the mismatch be? I see no settings in Bigcartel, is there one on Dugway?

Thanks.

mattwigham commented 11 years ago

That's strange. Can you give us an example subdomain or two for us to investigate? Thanks.

danrydell commented 11 years ago

These are some problematic ones I tried:

delaovejaalamadeja
nataliamartinez 

I change to an english one:

cukui
victorclothing

and all is fine.

Tested with a fresh project.

Thanks.

mattwigham commented 11 years ago

Hmm, I tested both stores on a fresh Dugway theme and both worked fine. Can you tell me which Ruby version you're using? (i.e. ruby -v) Also, can you take a screenshot of the error or paste in the backtrace? Thanks.

danrydell commented 11 years ago

I'm on W7 64bits but I installed Rubyinstaller 32b because of a problem some people have reported with 64b on 64b.

ruby 2.0.0p0 (2013-02-24) [i386-mingw32]

Will this do?

encoding--compatibilityerror at - 2

mattwigham commented 11 years ago

Thanks, that's helpful. So we're thinking your editor is saving the theme files as CP850. Do you know if that's the case? You could test that in IRB on one of your theme source files:

File.read('source/home.html').encoding
danrydell commented 11 years ago

I'm looking into it, because my Aptana saves UTF-8 by default, and even when look into each file's properties it shows they're UTF-8. I open with notepad it shows that too, and when I save specifically as utf-8 IRB stills reports CP850. The thing is even newly created project files are reported as CP850 by Ruby. It could be that my installation is creating files as CP850, but the fact that files that I save as UTF-8 still report to be CP850 is beyond me.

 File.read('source/layout.html').encoding
 => #<Encoding:CP850>

Thanks for your troubles.

mattwigham commented 11 years ago

I just pushed a new version (0.6.5) that may fix your issue. Try running gem install dugway to update and give it another try.

danrydell commented 11 years ago

It seems to work :o) No error now, and category names containing non-ascii cahracters display as expected. irb still reports files as CP850 though.

If you need me to check anything in particular just tell me.

Thanks for all you help.

mattwigham commented 11 years ago

Great! I think this will solve any weird encoding issues like this in the future, so I'll go ahead and close this issue. Thanks for helping us figure it out.

outerim commented 11 years ago

@mikimou Glad the issue is resolved. I'm guessing the issue is your system/ruby locale setting. I'm not sure if this was a compile time issue with the ruby installer or if it's based on your Windows language/keyboard settings. I'm posting this mainly for posterity and your edification so you can understand what ruby's doing in this case. From IRB try running this.

Encoding.find('locale')
Encoding.locale_charmap
Encoding.default_external

Here is an example from my system, ruby 2.0 mac osx 10.8:

irb(main):001:0> Encoding.find('locale')
=> #<Encoding:UTF-8>
irb(main):002:0> Encoding.locale_charmap
=> "UTF-8"
irb(main):003:0> Encoding.default_external
=> #<Encoding:UTF-8>

My guess is that your default_external encoding is CP850. This means that any time ruby reads a file from the filesystem, regardless of it's native encoding (as you said your editors were writing utf-8) it will be re-encoded to CP850. I'm not certain what ruby would do if your templates contained characters that fell outside the valid codes in CP850. It might throw an error or it might properly read the file as UTF-8. It might even munge the characters and interpret the file as CP850. Hard to say for certain without more testing.

Matt's change in 965b93a1967a10283f3cee1c31e22d1bd5aaec99 essentially ensures that the file is re-encoded (yet again?) as UTF-8. It's a reasonable change given the circumstances (i.e. we don't know what every dug way user's locale will be and hence what encoding files will be read in).

danrydell commented 11 years ago

Right on the spot:

irb(main):001:0> Encoding.find('locale')
=> #<Encoding:CP850>
irb(main):002:0> Encoding.locale_charmap
=> "CP850"
irb(main):003:0> Encoding.default_external
=> #<Encoding:CP850>
irb(main):004:0>

I run Windows 7 in Spanish with a Spanish locale. Encoding can't be set independently on windows (I think I ran into this some time ago)

 essentially ensures that the file is re-encoded (yet again?) as UTF-8

That fix might pose a problem, non ascii characters that are hardcoded (I mean in the templates, not pulled from Bigcartel) are not being interpreted right:

í => ├¡

Also, I think there might be BOM issues with the conversion.

@outerim Thanks a lot for your explanation ;o)

outerim commented 11 years ago

@ihearithurts. I was afraid of that issue he mentions in light of your fix. That is to say that if someone is using a locale that is a subset of UTF-8 and uses UTF-8 chars in their templates the initial fix will not be sufficient. It doesn't make matters worse, the damage is done when ruby first reads the file. Perhaps we should try setting the default_external encoding to UTF-8 instead of converting whatever we happen to get which may have already been destroyed. Another option may be to add something to the dugway shebang (eg -Ku). I'm thinking that may be the best way to go actually...

@mikimou can you run the following test program on your system and see what the output is?

#/usr/bin/env ruby -KU
puts Encoding.find('locale')
puts Encoding.locale_charmap
puts Encoding.default_external

You might need to change the shebang (first) line. Alternatively you could just save that content to say test.rb and run ruby -KU test.rb.

If it works as I anticipate it should set everything up to use UTF-8. If so, we should be able to change dugway's shebang and rely on all the content we read to be UTF-8 instead of having to re-encode on the fly.

danrydell commented 11 years ago

I ran it and it outputs:

CP850
CP850
UTF-8

Just in case:

I created a test.rb file containing just:

# C:/Ruby200/bin/ruby.exe -KU 
puts Encoding.find('locale') 
puts Encoding.locale_charmap 
puts Encoding.default_external

Then executed:

ruby -KU c:\Ruby200\dev\pruebas\source\test.rb

:o/

outerim commented 11 years ago

@mikimou I've pushed a new version of dugway that changes the way we deal with encodings. Instead of changing the encoding of files we read we set the default encodings when dugway starts. Dugway now assumes that all the files it will read from your system are UTF-8. The onus is therefore on the developer to make sure their tools are creating UTF-8 encoded files, which it seems you are.

Try out this new version and let me know if UTF-8 characters in the files are properly handled now.

danrydell commented 11 years ago

Templates behave correctly now, no BOM and characters are not converted, :o)

FYI irb still outputs the same info. Files seem to be CP850, as well as locale and locale_charmap. Encoding.default_external still UTF-8.

Thank you so much.

outerim commented 11 years ago

Yeah, that's expected. This won't change the way your interpreter works. More or less we're doing what rails does now which largely assumes that you're doing the right thing with your templates (eg encoding them in UTF-8) and ignores what your OS settings say your default encoding is.

On May 23, 2013, at 10:57 AM, mikimou notifications@github.com wrote:

Templates behave correctly now, no BOM and characters are not converted, :o)

FYI irb still outputs the same info. Files seem to be CP850, as well as locale and locale_charmap. Encoding.default_external still UTF-8.

Thank you so much.

— Reply to this email directly or view it on GitHub.