Distinguishing between identical-looking characters in test output

rtfeldman commented 6 years ago

As https://github.com/elm-community/elm-test/issues/217 notes:

Right now it's impossible to tell the difference between String.fromList [ '\1', '\2' ] and an empty string [ ... ] This caused some confusion when I thought the input was an empty string.

This is a problem that exists regardless of which test runner you're using, and regardless of what kind of test you're writing (fuzz or not).

Here's an example of test output that could be confusing right now:

"foo bar baz"
╷
│ Expect.equal
╵
"foo bar baz"

Those look identical! Why did Expect.equal fail?

The answer is that there are two different flavors of space being used here, but you can't tell that just from looking at the output.

This would become clear if the above output looked like this instead:

"foo bar baz"
╷
│ Expect.equal
╵
"foo" ++ "\x00A0" ++ "bar baz"

(The ++ is there because "foo\x00A0bar baz" would represent the string "fooꂺr baz" - which is not what we want.)

This seems like it would resolve https://github.com/elm-community/elm-test/issues/217 in a way that's both clear and which would not require any extra effort on the part of test authors.

Implementation notes:

We'd need to determine which characters should get escaped and which should get printed normally.
We'd also want to semantically distinguish between the different string sections (e.g. change it from a String to a List String so that test runners could, for example, do different syntax highlighting on the ++ operators.
The previous point would impact how diffing would need to be done.

Thoughts? @rhofour @mgold @drathier

rhofour commented 6 years ago

Things get a little more complicated when you consider we might see more than just a string, but I think this could still work. Consider: type Foo = Foo String String

I imagine we would want to handle things like this:

Foo "bar baz" "buz"
╷
│ Expect.equal
╵
Foo ("bar" ++ "\x00A0" ++ "bax") "buz"

I believe tricky bit there would be inserting the parenthesis correctly.

Alternatively, if we could get the upstream toString function to behave this way then we'd get this behavior for "free". I think that's where this really belongs. I've opened https://github.com/elm-lang/core/issues/930 and I referenced this issue there. However, I'm not sure how a change to toString would affect the diffing code.

drathier commented 6 years ago

The \xA0 parsing looks like a compiler bug to me, or a really strong feature request. We cannot escape unicode characters safely without using ++ and not mixing unicode literals and unicode escapes in the same string literal.

I'm 100% for splitting strings into `"foo" ++ "\xA0" ++ "bar baz" given the current compiler constraints.

The main question is, which characters do we consider printable? Do we have unicode support in the terminal? Does the terminal font support this new batch of emojis? Ascii-only seems way too restrictive.

I'd like both. Something like this, where we assume 100% unicode support, but also show a safe output with a lot of escape sequences if we're not using a small subset of unicode that we deem safe, probably just [A-Za-z0-9] and maybe a few special distinct printable ascii characters, including normal space (0x20).

I'd also like any escape sequences in one of the values to cause a escape sequence to be printed in the other value in the same format, so you can compare easily. See below, where space isn't a special character, but no-breaking-space is, so we escape both to show the difference.

(unicode: "foo" ++ "\x20" ++ "bar baz")
"foo bar baz"
╷
│ Expect.equal
╵
"foo bar baz"
(unicode: "foo" ++ "\xA0" ++ "bar baz")

Combine this with some Elm-style super helpful error messages to explain why we're printing both values two times. This should also help with problems with unicode normalization, like ö vs o-dieresis.

rtfeldman commented 6 years ago

The \xA0 parsing looks like a compiler bug to me, or a really strong feature request.

That's a good point. Given that Evan rewrote the parser for 0.19, this may not be an issue after then. 🤔

Alternatively, if we could get the upstream toString function to behave this way then we'd get this behavior for "free". I think that's where this really belongs.

For that to be true, I think there would need to be other compelling use cases for that, and I'm not sure what those would be. It seems like we have identified one case where we want this behavior: elm-test output. 😄

I believe tricky bit there would be inserting the parenthesis correctly.

Agreed! Here's an idea:

If what we're working on begins and ends with " (that is, we're rendering a plain String), no parens are necessary
Otherwise, we scan left from the character in question until we hit a (non-backslash-escaped) " character, and replace it with "), then repeat this process scanning to the right and replacing with ")

Not the most efficient, but seems like it would work. 😄

The main question is, which characters do we consider printable? Do we have unicode support in the terminal? Does the terminal font support this new batch of emojis? Ascii-only seems way too restrictive.

If the terminal font is lacking support, my understanding is that it'll still print characters that are discernable as different (so you can clearly see where the problem is)—they'll just be the wrong characters.

I think we can decouple that discussion from this one; if people report that as a problem, we can deal with it based on learning what their particular scenario is.

For our purposes, I think the answer is "do this for visually indistinguishable characters"—as in, any character for which there exists another Unicode character that cannot be visually told apart from it.

we assume 100% unicode support, but also show a safe output with a lot of escape sequences if we're not using a small subset of unicode that we deem safe, probably just [A-Za-z0-9] and maybe a few special distinct printable ascii characters, including normal space (0x20).

👍 Sounds good to me!

I'd also like any escape sequences in one of the values to cause a escape sequence to be printed in the other value in the same format, so you can compare easily.

Hm...is this better? To me, I found the example in OP easier to compare. I can quickly see "oh, there's a space there versus a...I don't know what the other thing is, but it's not a space."

In this case (error message or no), I have to do more work to reach that conclusion. ("\x20? Why the heck is that there? I never wrote that! [ read stuff ] Oh, that's a space. I see.")

rhofour commented 6 years ago

Hm...is this better? To me, I found the example in OP easier to compare. I can quickly see "oh, there's a space there versus a...I don't know what the other thing is, but it's not a space."

In this case (error message or no), I have to do more work to reach that conclusion. ("\x20? Why the heck is that there? I never wrote that! [ read stuff ] Oh, that's a space. I see.")

I like the idea of breaking things like this, but I agree it could be more confusing as-is. How about doing the same splitting, but still printing "normal" characters like this:

(unicode: "foo" ++ " " ++ "bar baz")
"foo bar baz"
╷
│ Expect.equal
╵
"foo bar baz"
(unicode: "foo" ++ "\xA0" ++ "bar baz")

That way it's still easy to compare the space vs. the unicode space, but it's also easy to see what the regular one is.

For that to be true, I think there would need to be other compelling use cases for that, and I'm not sure what those would be. It seems like we have identified one case where we want this behavior: elm-test output.

I think the same issues apply anywhere where a person is actually reading the output of toString. For example, consider a string with non-printable characters: x = "\x01\x02\x03" If you display this on the repl it will look identical to an empty string. This can be pretty surprising and confusing, especially if you're not expecting to it to contain non-printable characters.

For most types, toString returns a string you could feed to the repl to generate that type. If it's changed to output strings like "\x01\x02\x03" that's would continue to be true, but it isn't the case with the current behavior.

zwilias commented 6 years ago

The \xA0 parsing looks like a compiler bug to me, or a really strong feature request.

That's a good point. Given that Evan rewrote the parser for 0.19, this may not be an issue after then. 🤔

As far as I'm aware, this will be solved by removing octal and hexadecimal escape sequences, in favour of unicode escape sequences \u[a-fA-F0-9]{4}. So \xA0 would become \u00A0 which takes case of the parsing ambiguity.

drathier commented 6 years ago

Hm...is this better? To me, I found the example in OP easier to compare. I can quickly see "oh, there's a space there versus a...I don't know what the other thing is, but it's not a space."

I would like to print both, like in my snippet above. It's really unhelpful to get "\xD83E" ++ "\xDD13" ++ "\xD83D" ++ "\xDE0E" ++ "\xD83C" ++ "\xDF08" ++ "\xD83D" ++ "\xDE0D" instead of 🤓😎🌈😍.

So I would like something like this:

(unicode: "\xD83E" ++ "\xDD13" ++ "\xD83D" ++ "\xDE0E" ++ "\xD83C" ++ "\xDF08" ++ "\xD83D" ++ "\xDE0D")
"🤓😎🌈😍"
╷
│ Expect.equal
╵
"🤓🌈😎😍"
(unicode: "\xD83E" ++ "\xDD13" ++ "\xD83C" ++ "\xDF08" ++ "\xD83D" ++ "\xDE0E" ++ "\xD83D" ++ "\xDE0D")

(without all the ++, using \u12345 escapes, if that's supported in 0.19)

We would highlight the diff in both the string and the unicode-safe output. I'm not entirely sure the "(unicode: " ++ unicodeSafeEscape a ++ ")" is optimal, but I would like two lines like shown above.

I don't know what the other thing is, but it's not a space.

I would still like to know that it's something that looks like a space, because that's probably relevant to me. If it's a non-breaking space or an emoji tells me quite a lot about what could be the problem.

rtfeldman commented 6 years ago

this will be solved by removing octal and hexadecimal escape sequences, in favour of unicode escape sequences \u[a-fA-F0-9]{4}. So \xA0 would become \u00A0 which takes case of the parsing ambiguity.

Oh, excellent!

I would still like to know that it's something that looks like a space, because that's probably relevant to me. If it's a non-breaking space or an emoji tells me quite a lot about what could be the problem.

That's interesting. We could show hints whenever we do this, e.g.

"foo bar baz"
╷
│ Expect.equal
╵
"foo\u00A0bar baz"

Hint: "\u00A0" displays as the string " " - a nonbreaking space

We could always show the "\u____" displays as the string "__" part of the hint whenever we do that substitution, and have a few hardcoded supplemental messages for likely offenders such as "a nonbreaking space", "a zero-width space", etc.

drathier commented 6 years ago

Looks like Evan has switched to \u escapes when generating javascript just over a year ago: https://github.com/elm-lang/elm-compiler/commit/3ad0db6e8b5521cab209906631d7267c0d5ae61d

Guessing that means that we'll get \u escapes in elm soon enough.

mgold commented 6 years ago

I like @drathier's idea of displaying the unicode in addition to the normal output. I wonder if we can intelligently display this only when there are special characters present?

It also sounds like we should wait for 0.19, and gather more examples of characters that are problematic in practice.

drathier commented 6 years ago

Looks like we're waiting for 0.19 before we look into this. Adding a blocked label.

drathier commented 6 years ago

0.19 is here, we are no longer blocked. Related to https://github.com/elm-explorations/test/issues/38

elm-community / elm-test

Distinguishing between identical-looking characters in test output #225