JoshCheek / seeing_is_believing

Displays the results of every line of code in your file
1.3k stars 54 forks source link

Encoding issue #95

Closed JoshCheek closed 7 years ago

JoshCheek commented 7 years ago

Hopefully related to the binary/utf-8 issue in #92

This code: def π; end Explodes in Atom, but works correctly in the shell and TextMate2. I looked at the env vars, and LANG was set in TM but not in Atom.

When I opened Atom's console (cmd+opt+i) and ran process.env.LANG = 'en_US.UTF-8', it then worked correctly.

When I deleted it again: delete process.env.LANG, it then broke again. Stacktrace:

/Users/josh/.gem/ruby/2.3.1/gems/parser-2.3.1.4/lib/parser/source/buffer.rb:164:in `source=': invalid byte sequence in US-ASCII (EncodingError)
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/code.rb:26:in `initialize'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:15:in `new'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:15:in `initialize'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:8:in `new'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:8:in `call'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/engine.rb:102:in `normalized_cleaned_body'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/engine.rb:91:in `code'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/engine.rb:28:in `syntax_error?'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary.rb:35:in `call'
    from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/bin/seeing_is_believing:6:in `<top (required)>'
    from /Users/josh/.gem/ruby/2.3.1/bin/seeing_is_believing:22:in `load'
    from /Users/josh/.gem/ruby/2.3.1/bin/seeing_is_believing:22:in `<main>'
JoshCheek commented 7 years ago

Without the env var set:

$stdin.external_encoding  # => #<Encoding:US-ASCII>

With the env var set:

$stdin.external_encoding  # => #<Encoding:UTF-8>
JoshCheek commented 7 years ago

According to opengroup, which cites IEEE Std 1003.1-2001 (might be this, and is almost certainly this, except you can't read it without signing up or something... w/e)

Name Meaning
LANG This variable shall determine the locale category for native language, local customs, and coded character set in the absence of the LCALL and other LC* ( LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_TIME ) environment variables. This can be used by applications to determine the language to use for error messages and instructions, collating sequences, date formats, and so on.
LC_ALL This variable shall determine the values for all locale categories. The value of the LCALL environment variable has precedence over any of the other environment variables starting with LC ( LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_TIME ) and the LANG environment variable.
LC_COLLATE This variable shall determine the locale category for character collation. It determines collation information for regular expressions and sorting, including equivalence classes and multi-character collating elements, in various utilities and the strcoll() and strxfrm() functions. Additional semantics of this variable, if any, are implementation-defined.
LC_CTYPE This variable shall determine the locale category for character handling functions, such as tolower(), toupper(), and isalpha(). This environment variable determines the interpretation of sequences of bytes of text data as characters (for example, single as opposed to multi-byte characters), the classification of characters (for example, alpha, digit, graph), and the behavior of character classes. Additional semantics of this variable, if any, are implementation-defined.

There's a more extensive explanation on that site, including how to parse and make sense of the values, it's just prior to section 8.3

JoshCheek commented 7 years ago

Looks like MRI's hit this, too:

$ ruby < doc/ChangeLog-2.0.0 -e 'puts $stdin.read.split(/^(?=\S)/).select { |paragraph| paragraph["LANG"] }'
Sat Sep 29 19:40:32 2012  Hiroshi Shirosaki  <h.shirosaki@gmail.com>

    * test/ruby/test_unicode_escape.rb (TestUnicodeEscape#test_basic):
      set script encoding to work with LANG=C. It would work on both
      Windows and Unix. Refix of r37051.

Sat Sep 29 02:18:57 2012  Hiroshi Shirosaki  <h.shirosaki@gmail.com>

    * test/ruby/test_unicode_escape.rb (TestUnicodeEscape#test_basic):
      Use ruby only on Windows since the test fails on Unix with LANG=C.
      [ruby-core:47709] [Bug #7076]

Wed Aug  1 05:50:53 2012  Hiroshi Shirosaki  <h.shirosaki@gmail.com>

    * test/ruby/test_rubyoptions.rb (TestRubyOptions#test_encoding):
      Fix test_encoding failure on Windows.
      With chcp 65001, 1252 and 437, test_encoding failed. Test result
      depends on locale because LANG environment variable doesn't affect
      locale on Windows.
      [ruby-core:46872] [Bug #6813]
JoshCheek commented 7 years ago

Seems to have stemmed from https://github.com/JoshCheek/atom-seeing-is-believing/pull/24 but I would like to fix it in SiB (or at least guess the most likely answer in the event that the invoking context got it wrong), b/c this may be what is affecting #92, and it confusingly looks like a bug in SiB

JoshCheek commented 7 years ago

Hey, @avdi, I read your post, it was great! It all made a lot of sense to me except for the utf-8 issue. I tried to express it in a coherent manner, but instead I think I just destroyed my brain >.< In an abstract sense (ie conclusions instead of explanations) it's:

If the environment set everything correctly, it should work without needing to care about encodings. So if it blew up, then either SiB got out of whack somewhere, or it's an environment issue. If it's an SiB thing, overriding the defaults, then we should find / fix it. If it's an env thing, we should try to guess a few common environments, and if not we should explain to the user what the problem is

I guess an example would be that a user is actually using a different encoding, then we want to avoid transcoding it since that can lose information (I assume there are encodings with info that is not encompassed by utf-8, though I haven't been able to find an example). So we should do everything in the users's encoding, and translate our internal strings into their encoding and write the file in their encoding.

To figure out which is going on, I've been trying to recreate your issue. I'm pretty sure I can reproduce each of your examples from the blog, but it would be helpful for me if you could describe to me the encoding issue you experienced in #92

Eg:

avdi commented 7 years ago

More later, but to be clear, I never experienced an encoding issue from within emacs. The problems were bigger than just encoding issues when I first set out to make the tests pass on windows.

That said, I encountered many, many encoding issues in the test suite itself.

On Windows, Ruby is going to believe that text data it is receiving from files and STDIN is IBM437, and if it is wrong, things will break. And unless told otherwise it will always transcode internal utf-8 to IBM437 when it writes, which WILL lose information.

There's no magic sniffing here - it can't know if, say, a cooperating process is actually piping it utf-8 unless that process somehow tells it so (or it just assumes by hardcoding an encoding).

-- Avdi Grimm http://avdi.org

On Dec 15, 2016 02:59, "Josh Cheek" notifications@github.com wrote:

Hey, @avdi https://github.com/avdi, I read your post, it was great! It all made a lot of sense to me except for the utf-8 issue. I tried to express it in a coherent manner, but instead I think I just destroyed my brain >.< In an abstract sense (ie conclusions instead of explanations) it's:

If the environment set everything correctly, it should work without needing to care about encodings. So if it blew up, then either SiB got out of whack somewhere, or it's an environment issue. If it's an SiB thing, overriding the defaults, then we should find / fix it. If it's an env thing, we should try to guess a few common environments, and if not we should explain to the user what the problem is

I guess an example would be that a user is actually using a different encoding, then we want to avoid transcoding it since that can lose information (I assume there are encodings with info that is not encompassed by utf-8, though I haven't been able to find an example). So we should do everything in the users's encoding, and translate our internal strings into their encoding and write the file in their encoding.

To figure out which is going on, I've been trying to recreate your issue. I'm pretty sure I can reproduce each of your examples from the blog, but it would be helpful for me if you could describe to me the encoding issue you experienced in #92 https://github.com/JoshCheek/seeing_is_believing/pull/92

Eg:

-

Did you run SiB from emacs?

What encoding do you think the file was? (eg what does Emacs say it is?)

How was it erroring? stacktrace / mojibaked / something else? If stacktrace, can you provide it? If mojibaked, can you screenshot the incorrect chars?

What was the text that caused the issue? (as screenshot in order to avoid encoding issues here, too :P)

Did you have any of the environment variables LANG, LC_ALL, LC_COLLATE, LC_CTYPE set? If so, what were their values?

If you're able to invoke Ruby the same manner that you invoked SiB, can you try invoking this and let me know what it says?

require 'pp' r, w = IO.pipe pp [ [Encoding, Encoding.default_internal, Encoding.default_external, Encoding.locale_charmap, Encoding.find("external"), Encoding.find("internal"), Encoding.find("locale"), Encoding.find("filesystem"), ], [$stdin, $stdin.internal_encoding, $stdin.external_encoding], [$stdout, $stdout.internal_encoding, $stdout.external_encoding], ["pipe read", r.internal_encoding, r.external_encoding] , ["pipe write", w.internal_encoding, w.external_encoding], ["ENCODING", ENCODING], [String, "".encoding], ["ENV[LANG]", ENV["LANG"]], ["ENV[LC_ALL]", ENV["LC_ALL"]], ["ENV[LC_COLLATE]", ENV["LC_COLLATE"]], ["ENV[LC_CTYPE]", ENV["LC_CTYPE"]], ]

-

Also, if you have an ability to run bash on that machine, can you run this:

forEachEncoding() { code="$1" echo "$code" echo -n ' LANG= | ' LANG= ruby -e "p $code" echo -n ' LANG=en_US.IBM437 | ' LANG=en_US.IBM437 ruby -e "p $code" echo -n ' LANG=de_CH.IBM437 | ' LANG=de_CH.IBM437 ruby -e "p $code" echo -n ' # encoding: IBM437 | ' ruby -e "# encoding: IBM437p $code" echo -n ' --internal-encoding IBM437 | ' ruby --internal-encoding IBM437 -e "p $code" echo -n ' --external-encoding IBM437 | ' ruby --external-encoding IBM437 -e "p $code" echo -n ' -Ks | ' ruby -Ks -e "p $code" }

forEachEncoding 'Encoding.default_internal' forEachEncoding 'Encoding.default_external' forEachEncoding 'Encoding.locale_charmap' forEachEncoding 'Encoding.find("locale")' forEachEncoding 'Encoding.find("external")' forEachEncoding 'Encoding.find("internal")' forEachEncoding 'Encoding.find("locale")' forEachEncoding 'Encoding.find("filesystem")' forEachEncoding 'Encoding.find("filesystem")'

forEachEncoding '$stdin.internal_encoding' forEachEncoding '$stdin.external_encoding' forEachEncoding '$stdout.internal_encoding' forEachEncoding '$stdout.external_encoding'

forEachEncoding 'IO.pipe.first.internal_encoding' forEachEncoding 'IO.pipe.first.external_encoding' forEachEncoding 'IO.pipe.last.internal_encoding' forEachEncoding 'IO.pipe.last.external_encoding'

forEachEncoding 'ENCODING' forEachEncoding '"".encoding'

I'm piping it through this monstrosity to try and make sense of all the info:

ruby -rpp -e 'rows_of_cols = $stdin.read.split(/^(?=\S)/).map { |row| row_name, cols = row.lines.map(&:chomp); [row_name, cols.map { |col| col.split("|").map(&:strip) }] }; colnames = rows_of_cols.first.last.map(&:first); rownames = rows_of_cols.map(&:first); rows = rows_of_cols.map(&:last).map { |cols| cols.map(&:last) }; format = "%-#{rownames.max_by(&:length).length}s | #{[colnames, rows].transpose.map { |cs| "%-#{cs.max_by(&:length).length}s" }.join(" | ")}\n"; printf format, "", colnames; printf format, format.scan(/\d+/).map { |n| "-" n.to_i }; rownames.each_with_index { |name, y| printf format, name, rows[y] }'

[image: screenshot 2016-12-15 01 32 27] https://cloud.githubusercontent.com/assets/77495/21215369/6475961a-c266-11e6-86bf-73a3fad45e76.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JoshCheek/seeing_is_believing/issues/95#issuecomment-267262635, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAD1jC2yt2AR7XjbeGav4XqbRiI_HVZks5rIPNYgaJpZM4LNg0a .

avdi commented 7 years ago

Oh, and running Bash scripts is probably meaningless because the environment inside bash-on-windows tools is likely to skew from baseline windows enough to render answers which aren't generally applicable.

I think you're right that there's probably a more general solution to be had here. I'll be thinking about it. But I think it's also true that a lot of the assumptions most people rely on (or just don't think about) having to do with encodings in Ruby are wrong. And operating in a UNIX environment just makes those assumptions invisible.

-- Avdi Grimm http://avdi.org

On Dec 15, 2016 05:54, "Avdi Grimm" vendor@avdi.org wrote:

More later, but to be clear, I never experienced an encoding issue from within emacs. The problems were bigger than just encoding issues when I first set out to make the tests pass on windows.

That said, I encountered many, many encoding issues in the test suite itself.

On Windows, Ruby is going to believe that text data it is receiving from files and STDIN is IBM437, and if it is wrong, things will break. And unless told otherwise it will always transcode internal utf-8 to IBM437 when it writes, which WILL lose information.

There's no magic sniffing here - it can't know if, say, a cooperating process is actually piping it utf-8 unless that process somehow tells it so (or it just assumes by hardcoding an encoding).

-- Avdi Grimm http://avdi.org

On Dec 15, 2016 02:59, "Josh Cheek" notifications@github.com wrote:

Hey, @avdi https://github.com/avdi, I read your post, it was great! It all made a lot of sense to me except for the utf-8 issue. I tried to express it in a coherent manner, but instead I think I just destroyed my brain >.< In an abstract sense (ie conclusions instead of explanations) it's:

If the environment set everything correctly, it should work without needing to care about encodings. So if it blew up, then either SiB got out of whack somewhere, or it's an environment issue. If it's an SiB thing, overriding the defaults, then we should find / fix it. If it's an env thing, we should try to guess a few common environments, and if not we should explain to the user what the problem is

I guess an example would be that a user is actually using a different encoding, then we want to avoid transcoding it since that can lose information (I assume there are encodings with info that is not encompassed by utf-8, though I haven't been able to find an example). So we should do everything in the users's encoding, and translate our internal strings into their encoding and write the file in their encoding.

To figure out which is going on, I've been trying to recreate your issue. I'm pretty sure I can reproduce each of your examples from the blog, but it would be helpful for me if you could describe to me the encoding issue you experienced in #92 https://github.com/JoshCheek/seeing_is_believing/pull/92

Eg:

-

Did you run SiB from emacs?

What encoding do you think the file was? (eg what does Emacs say it is?)

How was it erroring? stacktrace / mojibaked / something else? If stacktrace, can you provide it? If mojibaked, can you screenshot the incorrect chars?

What was the text that caused the issue? (as screenshot in order to avoid encoding issues here, too :P)

Did you have any of the environment variables LANG, LC_ALL, LC_COLLATE, LC_CTYPE set? If so, what were their values?

If you're able to invoke Ruby the same manner that you invoked SiB, can you try invoking this and let me know what it says?

require 'pp' r, w = IO.pipe pp [ [Encoding, Encoding.default_internal, Encoding.default_external, Encoding.locale_charmap, Encoding.find("external"), Encoding.find("internal"), Encoding.find("locale"), Encoding.find("filesystem"), ], [$stdin, $stdin.internal_encoding, $stdin.external_encoding], [$stdout, $stdout.internal_encoding, $stdout.external_encoding], ["pipe read", r.internal_encoding, r.external_encoding] , ["pipe write", w.internal_encoding, w.external_encoding], ["ENCODING", ENCODING], [String, "".encoding], ["ENV[LANG]", ENV["LANG"]], ["ENV[LC_ALL]", ENV["LC_ALL"]], ["ENV[LC_COLLATE]", ENV["LC_COLLATE"]], ["ENV[LC_CTYPE]", ENV["LC_CTYPE"]], ]

-

Also, if you have an ability to run bash on that machine, can you run this:

forEachEncoding() { code="$1" echo "$code" echo -n ' LANG= | ' LANG= ruby -e "p $code" echo -n ' LANG=en_US.IBM437 | ' LANG=en_US.IBM437 ruby -e "p $code" echo -n ' LANG=de_CH.IBM437 | ' LANG=de_CH.IBM437 ruby -e "p $code" echo -n ' # encoding: IBM437 | ' ruby -e "# encoding: IBM437p $code" echo -n ' --internal-encoding IBM437 | ' ruby --internal-encoding IBM437 -e "p $code" echo -n ' --external-encoding IBM437 | ' ruby --external-encoding IBM437 -e "p $code" echo -n ' -Ks | ' ruby -Ks -e "p $code" }

forEachEncoding 'Encoding.default_internal' forEachEncoding 'Encoding.default_external' forEachEncoding 'Encoding.locale_charmap' forEachEncoding 'Encoding.find("locale")' forEachEncoding 'Encoding.find("external")' forEachEncoding 'Encoding.find("internal")' forEachEncoding 'Encoding.find("locale")' forEachEncoding 'Encoding.find("filesystem")' forEachEncoding 'Encoding.find("filesystem")'

forEachEncoding '$stdin.internal_encoding' forEachEncoding '$stdin.external_encoding' forEachEncoding '$stdout.internal_encoding' forEachEncoding '$stdout.external_encoding'

forEachEncoding 'IO.pipe.first.internal_encoding' forEachEncoding 'IO.pipe.first.external_encoding' forEachEncoding 'IO.pipe.last.internal_encoding' forEachEncoding 'IO.pipe.last.external_encoding'

forEachEncoding 'ENCODING' forEachEncoding '"".encoding'

I'm piping it through this monstrosity to try and make sense of all the info:

ruby -rpp -e 'rows_of_cols = $stdin.read.split(/^(?=\S)/).map { |row| row_name, cols = row.lines.map(&:chomp); [row_name, cols.map { |col| col.split("|").map(&:strip) }] }; colnames = rows_of_cols.first.last.map(&:first); rownames = rows_of_cols.map(&:first); rows = rows_of_cols.map(&:last).map { |cols| cols.map(&:last) }; format = "%-#{rownames.max_by(&:length).length}s | #{[colnames, rows].transpose.map { |cs| "%-#{cs.max_by(&:length).length}s" }.join(" | ")}\n"; printf format, "", colnames; printf format, format.scan(/\d+/).map { |n| "-" n.to_i }; rownames.each_with_index { |name, y| printf format, name, rows[y] }'

[image: screenshot 2016-12-15 01 32 27] https://cloud.githubusercontent.com/assets/77495/21215369/6475961a-c266-11e6-86bf-73a3fad45e76.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JoshCheek/seeing_is_believing/issues/95#issuecomment-267262635, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAD1jC2yt2AR7XjbeGav4XqbRiI_HVZks5rIPNYgaJpZM4LNg0a .

JoshCheek commented 7 years ago

On Windows, Ruby is going to believe that text data it is receiving from files and STDIN is IBM437, and if it is wrong, things will break. And unless told otherwise it will always transcode internal utf-8 to IBM437 when it writes, which WILL lose information.

It should be okay, the only data we insert are the markers and object inspections. Markers use the chars # =>~!., which are all used by Ruby, so should be transcodable. Inspections were obtained from the user's code, so barring an encoding issue that exists regardless of SiB, we should be okay to embed them into a comment. If not, then elide the result or mojibake it, but definitely don't explode.

There's no magic sniffing here - it can't know if, say, a cooperating process is actually piping it utf-8 unless that process somehow tells it so (or it just assumes by hardcoding an encoding).

For our event stream, we can explicitly set the pipe to UTF-8. No strings provided by the user pass directly through it, only string literals generated within SiB, user data is marshalled and base 64'd, so it passes through the pipe as an ascii subset, and the encoding is preserved on the other side.

Oh, and running Bash scripts is probably meaningless because the environment inside bash-on-windows tools is likely to skew from baseline windows enough to render answers which aren't generally applicable.

Aye, I deleted that from the issue after I realized it.

I think it's also true that a lot of the assumptions most people rely on (or just don't think about) having to do with encodings in Ruby are wrong. And operating in a UNIX environment just makes those assumptions invisible.

Definitely. Having AppVeyor and users in that env is awesome ^_^

avdi commented 7 years ago

Well if the event stream is over TCP, rather than a pipe, the point is moot. We don't need to worry about that aspect.

The elephant in the room is the encoding of code-to-be-evaluated.

Here's what happens now, without any fixes:

  1. The user writes some code in an editor. Presumably, the editor uses a unicode format internally. Maybe they include some non-ASCII characters in their code.
  2. They either pipe the code to SiB, or write it to a file. Either way, it PROBABLY is written out by the editor to stream or disk as UTF-8, but this may not be true depending on their locale.
  3. So far, so good.
  4. Long before SiB evaluates the user-provided code, it reads it in from either file or STDIN. Windows tells it it is reading IBM437-encoded text, and it doesn't know any better, so it transcodes to internal UTF-8, mangling any non-ASCII characters on the way.
  5. The game is already up, even before we set up any subprocess communication. Any non-ASCII in the code-to-be-evaled is permanently mangled.

Here's what my definitely-inadequate hacked-up patch does:

...

  1. So far, so good.
  2. Long before SiB evaluates the user-provided code, it reads it in from either file or STDIN. We hardcode the assumption that the code was written to disk as UTF-8, and force that encoding. It "transcodes" external UTF-8 to internal UTF-8, mangling nothing.
  3. ...unless we were wrong about that code being in UTF-8 (maybe it was written to disk in Shift-JIS, for instance). In which case, text is mangled, and the game is up :-(

I can think of a few potential scenarios for what OUGHT to happen instead, but I'm still mulling them over.

On Thu, Dec 15, 2016 at 12:55 PM, Josh Cheek notifications@github.com wrote:

On Windows, Ruby is going to believe that text data it is receiving from files and STDIN is IBM437, and if it is wrong, things will break. And unless told otherwise it will always transcode internal utf-8 to IBM437 when it writes, which WILL lose information.

It should be okay, the only data we insert are the markers and object inspections. Markers use the chars # =>~!., which are all used by Ruby, so should be transcodable. Inspections were obtained from the user's code, so barring an encoding issue that exists regardless of SiB, we should be okay to embed them into a comment. If not, then elide the result or mojibake it, but definitely don't explode.

There's no magic sniffing here - it can't know if, say, a cooperating process is actually piping it utf-8 unless that process somehow tells it so (or it just assumes by hardcoding an encoding).

For our event stream, we can explicitly set the pipe to UTF-8. No strings provided by the user pass directly through it, only string literals generated within SiB, user data is marshalled and base 64'd, so it passes through the pipe as an ascii subset, and the encoding is preserved on the other side.

Oh, and running Bash scripts is probably meaningless because the environment inside bash-on-windows tools is likely to skew from baseline windows enough to render answers which aren't generally applicable.

Aye, I deleted that from the issue after I realized it.

I think it's also true that a lot of the assumptions most people rely on (or just don't think about) having to do with encodings in Ruby are wrong. And operating in a UNIX environment just makes those assumptions invisible.

Definitely. Having AppVeyor and users in that env is awesome ^_^

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JoshCheek/seeing_is_believing/issues/95#issuecomment-267395471, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAD1otiqh-sCpEnXO5KN3xvsf2AU6ldks5rIX78gaJpZM4LNg0a .

avdi commented 7 years ago

It may be the case that we don't want to give the subsidiary Ruby process any special encoding help. Since running code un-tagged with an encoding comment under SiB should probably have the same outcome (even if it's a bad outcome) as just running the regular Ruby interpreter on it.

If that's the case, then it may be sufficient to just make sure that the entire chain of SiB reading in the source code and then writing it back out is done in binary mode, treating it as purely opaque data.

On Thu, Dec 15, 2016 at 1:30 PM, Avdi Grimm vendor@avdi.org wrote:

Well if the event stream is over TCP, rather than a pipe, the point is moot. We don't need to worry about that aspect.

The elephant in the room is the encoding of code-to-be-evaluated.

Here's what happens now, without any fixes:

  1. The user writes some code in an editor. Presumably, the editor uses a unicode format internally. Maybe they include some non-ASCII characters in their code.
  2. They either pipe the code to SiB, or write it to a file. Either way, it PROBABLY is written out by the editor to stream or disk as UTF-8, but this may not be true depending on their locale.
  3. So far, so good.
  4. Long before SiB evaluates the user-provided code, it reads it in from either file or STDIN. Windows tells it it is reading IBM437-encoded text, and it doesn't know any better, so it transcodes to internal UTF-8, mangling any non-ASCII characters on the way.
  5. The game is already up, even before we set up any subprocess communication. Any non-ASCII in the code-to-be-evaled is permanently mangled.

Here's what my definitely-inadequate hacked-up patch does:

...

  1. So far, so good.
  2. Long before SiB evaluates the user-provided code, it reads it in from either file or STDIN. We hardcode the assumption that the code was written to disk as UTF-8, and force that encoding. It "transcodes" external UTF-8 to internal UTF-8, mangling nothing.
  3. ...unless we were wrong about that code being in UTF-8 (maybe it was written to disk in Shift-JIS, for instance). In which case, text is mangled, and the game is up :-(

I can think of a few potential scenarios for what OUGHT to happen instead, but I'm still mulling them over.

On Thu, Dec 15, 2016 at 12:55 PM, Josh Cheek notifications@github.com wrote:

On Windows, Ruby is going to believe that text data it is receiving from files and STDIN is IBM437, and if it is wrong, things will break. And unless told otherwise it will always transcode internal utf-8 to IBM437 when it writes, which WILL lose information.

It should be okay, the only data we insert are the markers and object inspections. Markers use the chars # =>~!., which are all used by Ruby, so should be transcodable. Inspections were obtained from the user's code, so barring an encoding issue that exists regardless of SiB, we should be okay to embed them into a comment. If not, then elide the result or mojibake it, but definitely don't explode.

There's no magic sniffing here - it can't know if, say, a cooperating process is actually piping it utf-8 unless that process somehow tells it so (or it just assumes by hardcoding an encoding).

For our event stream, we can explicitly set the pipe to UTF-8. No strings provided by the user pass directly through it, only string literals generated within SiB, user data is marshalled and base 64'd, so it passes through the pipe as an ascii subset, and the encoding is preserved on the other side.

Oh, and running Bash scripts is probably meaningless because the environment inside bash-on-windows tools is likely to skew from baseline windows enough to render answers which aren't generally applicable.

Aye, I deleted that from the issue after I realized it.

I think it's also true that a lot of the assumptions most people rely on (or just don't think about) having to do with encodings in Ruby are wrong. And operating in a UNIX environment just makes those assumptions invisible.

Definitely. Having AppVeyor and users in that env is awesome ^_^

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JoshCheek/seeing_is_believing/issues/95#issuecomment-267395471, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAD1otiqh-sCpEnXO5KN3xvsf2AU6ldks5rIX78gaJpZM4LNg0a .

avdi commented 7 years ago

If you want to play with any of this stuff yourself, MS has free VirtualBox images: https://developer.microsoft.com/en-us/microsoft-edge/tools/vms/