String values render non-ASCII characters as code points

kevingriffin commented 9 years ago

When evaluating Ruby in Atom, strings with multibyte characters display as code points.

All of these examples are done with seeing_is_believing (2.1.4) and Ruby 2.1.3 on OS X 10.10.1.

Atom Example:

# encoding: utf-8
"楕円曲線暗号"  # => "\u6955\u5186\u66F2\u7DDA\u6697\u53F7"
"テスト"       # => "\u30C6\u30B9\u30C8"

This only seems to happen in Atom, and not other editors, so it seems like the issue might be best here. For reference, here's the same file with SIB run over it in TextMate:

TextMate Example:

# encoding: utf-8
"楕円曲線暗号"  # => "楕円曲線暗号"
"テスト"     # => "テスト"

JoshCheek commented 9 years ago

Hmmmm. Fucking encodings :(

What do you get when you do:

Encoding.default_external
Encoding.default_internal

JoshCheek commented 9 years ago

It looks like Atom gives me different values :/

Compared by looking at the bytes printed by this:

# encoding: utf-8
puts File.read(__FILE__).bytes  # => nil
"楕".encoding                    # => #<Encoding:UTF-8>

I don't have any good way to explain it, but figure I should document what I tried:

# encoding: utf-8

# for whatever reason, the default is US-ASCII and nil, despite our assertion
Encoding.default_external  # => #<Encoding:US-ASCII>
Encoding.default_internal  # => nil

# but the actual internal encoding is US-ASCII
ARGF.external_encoding  # => #<Encoding:US-ASCII>
ARGF.internal_encoding  # => #<Encoding:US-ASCII>

# but even the file system thinks it's supposed to be utf-8
`file #{__FILE__} -bI`  # => "text/plain; charset=utf-8\n"

# when inspecting, the encoding changes
s = "テスト"           # => "\u30C6\u30B9\u30C8"
s.encoding          # => #<Encoding:UTF-8>
s.inspect.encoding  # => #<Encoding:US-ASCII>

# if we set the encoding, this fixes it
# also, it's confused about the lenght of the line
# I noticed on TextMate, it looked much more "scrunched", idk if this matters
Encoding.default_internal = Encoding::UTF_8  # => #<Encoding:UTF-8>
s = "テスト"                                    # => "テスト"

Do any of these look more correct?

puts Encoding.constants

kevingriffin commented 9 years ago

I wonder if it has to do with environment variables. It seems like the C functionrb_enc_default_internal looks at $LANG to decide the encoding when calling String#inspect:

Textmate

ENV["LANG"]           # => "en_US.UTF-8"
"".inspect.encoding   # => #<Encoding:UTF-8>

Atom

ENV["LANG"]               # => nil
"".inspect.encoding       # => #<Encoding:US-ASCII>

JoshCheek commented 9 years ago

It totally does! And you can set env vars in the config.cson, e.g. mine currently contains:

  'seeing-is-believing':
    'ruby-command': '/Users/josh/code/dotfiles/bin/sib_ruby'
    'add-to-env':
      'SHELL': 'bash'
      'LANG':  'en_US.UTF-8'

I should default it to that if it doesn't have one set, but this will fix it in the meantime.

JoshCheek commented 9 years ago

The incorrect alignment is due to the font rendering the characters double-wide. IDK if there's a way to deal with that, I'm pretty ignorant about encodings. If we can detect what types of characters are going to have this issue, I'm willing to accept a flag in SiB to treat chars rendered double-wide as if they are actually doublewide, in order to preserve correct alignment.

JoshCheek commented 9 years ago

Hey, Kevin, how much do you know about encodings? Mostly wondering whether its okay for me to use UTF8 internally on all inspected data. Previously I didn't mess with encodings at all, but when they get horked, they totally mess it up and I'm not sure that my solution is correct since I'm too ignorant to even come up with reasonable test cases.

Here's the code that does it (not yet released). It's based on this issue.

kevingriffin commented 9 years ago

The only thing I can think of is a scenario in which you had strings with two encodings, neither of which were UTF-8—encoding A and encoding B. If there was a direct mapping between a character C from A to B, but not from A to UTF-8 or UTF-8 to B, you'd probably end up stomping a valid character with scrub. Unfortunately I wasn't able to come up with actual values for A and B and C here to make a test case out of.

That said, I'm not sure where the code comes from that CommentLines operates on, but if you know it's UTF-8, then there's probably no harm—you've got to get those two strings together, and it's hard to imagine coercing the inspected value into UTF-8 is avoidable.

I'm not sure you know that in every case, though. Here's a test program that SiB fails on for me:

# encoding: shift_jis
katakana_ka = "カ"
puts katakana_ka

I get the following stack trace running SiB 2.1.4 with Ruby 2.1.3 on the file:

/Users/kevin/.rvm/gems/ruby-2.1.3/gems/parser-2.1.9/lib/parser/source/buffer.rb:98:in `encode'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/parser-2.1.9/lib/parser/source/buffer.rb:98:in `reencode_string'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/parser-2.1.9/lib/parser/source/buffer.rb:153:in `source='
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/parser_helpers.rb:18:in `initialize_parser'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/wrap_expressions.rb:21:in `initialize'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/wrap_expressions.rb:12:in `new'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/wrap_expressions.rb:12:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing.rb:52:in `program_that_will_record_expressions'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing.rb:34:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing.rb:15:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary/add_annotations.rb:29:in `initialize'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:154:in `new'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:154:in `printer'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:114:in `evaluate_program'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:45:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/bin/seeing_is_believing:6:in `<top (required)>'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/seeing_is_believing:23:in `load'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/seeing_is_believing:23:in `<main>'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/ruby_executable_hooks:15:in `eval'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/ruby_executable_hooks:15:in `<main>'

My guess is that your solution would have prevented this from happening, but replaced the カ character with a � in the comment. Ruby runs the program and outputs the expected character.

I'm not sure what the correct behavior here is exactly, but it seems like an ideal solution should try to defer to the original encodings of the source code and inspected value until it's known that a coercion isn't possible. Without digging a bit deeper into how SiB works, I'm not sure if this is practical.

A bit unrelated, scrub is a Ruby 2.1 feature, which would probably lock out a lot of versions you currently support. Then again, this change is in your 3.0 branch, so you're probably aware. :)

JoshCheek / atom-seeing-is-believing

String values render non-ASCII characters as code points #8

Textmate

Atom