Closed kevingriffin closed 9 years ago
Hmmmm. Fucking encodings :(
What do you get when you do:
Encoding.default_external
Encoding.default_internal
It looks like Atom gives me different values :/
Compared by looking at the bytes printed by this:
# encoding: utf-8
puts File.read(__FILE__).bytes # => nil
"楕".encoding # => #<Encoding:UTF-8>
I don't have any good way to explain it, but figure I should document what I tried:
# encoding: utf-8
# for whatever reason, the default is US-ASCII and nil, despite our assertion
Encoding.default_external # => #<Encoding:US-ASCII>
Encoding.default_internal # => nil
# but the actual internal encoding is US-ASCII
ARGF.external_encoding # => #<Encoding:US-ASCII>
ARGF.internal_encoding # => #<Encoding:US-ASCII>
# but even the file system thinks it's supposed to be utf-8
`file #{__FILE__} -bI` # => "text/plain; charset=utf-8\n"
# when inspecting, the encoding changes
s = "テスト" # => "\u30C6\u30B9\u30C8"
s.encoding # => #<Encoding:UTF-8>
s.inspect.encoding # => #<Encoding:US-ASCII>
# if we set the encoding, this fixes it
# also, it's confused about the lenght of the line
# I noticed on TextMate, it looked much more "scrunched", idk if this matters
Encoding.default_internal = Encoding::UTF_8 # => #<Encoding:UTF-8>
s = "テスト" # => "テスト"
Do any of these look more correct?
puts Encoding.constants
I wonder if it has to do with environment variables. It seems like the C functionrb_enc_default_internal
looks at $LANG
to decide the encoding when calling String#inspect
:
ENV["LANG"] # => "en_US.UTF-8"
"".inspect.encoding # => #<Encoding:UTF-8>
ENV["LANG"] # => nil
"".inspect.encoding # => #<Encoding:US-ASCII>
It totally does! And you can set env vars in the config.cson, e.g. mine currently contains:
'seeing-is-believing':
'ruby-command': '/Users/josh/code/dotfiles/bin/sib_ruby'
'add-to-env':
'SHELL': 'bash'
'LANG': 'en_US.UTF-8'
I should default it to that if it doesn't have one set, but this will fix it in the meantime.
The incorrect alignment is due to the font rendering the characters double-wide. IDK if there's a way to deal with that, I'm pretty ignorant about encodings. If we can detect what types of characters are going to have this issue, I'm willing to accept a flag in SiB to treat chars rendered double-wide as if they are actually doublewide, in order to preserve correct alignment.
Hey, Kevin, how much do you know about encodings? Mostly wondering whether its okay for me to use UTF8 internally on all inspected data. Previously I didn't mess with encodings at all, but when they get horked, they totally mess it up and I'm not sure that my solution is correct since I'm too ignorant to even come up with reasonable test cases.
Here's the code that does it (not yet released). It's based on this issue.
The only thing I can think of is a scenario in which you had strings with two encodings, neither of which were UTF-8—encoding A and encoding B. If there was a direct mapping between a character C from A to B, but not from A to UTF-8 or UTF-8 to B, you'd probably end up stomping a valid character with scrub
. Unfortunately I wasn't able to come up with actual values for A and B and C here to make a test case out of.
That said, I'm not sure where the code comes from that CommentLines
operates on, but if you know it's UTF-8, then there's probably no harm—you've got to get those two strings together, and it's hard to imagine coercing the inspected value into UTF-8 is avoidable.
I'm not sure you know that in every case, though. Here's a test program that SiB fails on for me:
# encoding: shift_jis
katakana_ka = "カ"
puts katakana_ka
I get the following stack trace running SiB 2.1.4 with Ruby 2.1.3 on the file:
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/parser-2.1.9/lib/parser/source/buffer.rb:98:in `encode'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/parser-2.1.9/lib/parser/source/buffer.rb:98:in `reencode_string'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/parser-2.1.9/lib/parser/source/buffer.rb:153:in `source='
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/parser_helpers.rb:18:in `initialize_parser'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/wrap_expressions.rb:21:in `initialize'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/wrap_expressions.rb:12:in `new'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/wrap_expressions.rb:12:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing.rb:52:in `program_that_will_record_expressions'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing.rb:34:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing.rb:15:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary/add_annotations.rb:29:in `initialize'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:154:in `new'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:154:in `printer'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:114:in `evaluate_program'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/lib/seeing_is_believing/binary.rb:45:in `call'
/Users/kevin/.rvm/gems/ruby-2.1.3/gems/seeing_is_believing-2.1.4/bin/seeing_is_believing:6:in `<top (required)>'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/seeing_is_believing:23:in `load'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/seeing_is_believing:23:in `<main>'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/ruby_executable_hooks:15:in `eval'
/Users/kevin/.rvm/gems/ruby-2.1.3/bin/ruby_executable_hooks:15:in `<main>'
My guess is that your solution would have prevented this from happening, but replaced the カ character with a � in the comment. Ruby runs the program and outputs the expected character.
I'm not sure what the correct behavior here is exactly, but it seems like an ideal solution should try to defer to the original encodings of the source code and inspected value until it's known that a coercion isn't possible. Without digging a bit deeper into how SiB works, I'm not sure if this is practical.
A bit unrelated, scrub
is a Ruby 2.1 feature, which would probably lock out a lot of versions you currently support. Then again, this change is in your 3.0 branch, so you're probably aware. :)
When evaluating Ruby in Atom, strings with multibyte characters display as code points.
All of these examples are done with seeing_is_believing (2.1.4) and Ruby 2.1.3 on OS X 10.10.1.
Atom Example:
This only seems to happen in Atom, and not other editors, so it seems like the issue might be best here. For reference, here's the same file with SIB run over it in TextMate:
TextMate Example: