Should the Chapel parser allow curly quotes and UTF-8 spaces?

mppf commented 5 years ago

Occasionally one ends up with a code sample that unintentionally contains non-ascii characters. For example, when pasting examples from PowerPoint, I observed these issues:

curly single and double quotes
non-ASCII space characters

It seems to me that the Chapel parser should allow these UTF-8 character sequences. For the whitespace characters, it seems to me that these should just be treated similarly to space. For the quotes, we could translate them into the most similar ascii character (i.e. straightening the quotes like“bla” -> "bla" ). Or, we could make string variants that start with open quotes and end with close quotes. (But that has the drawback of producing parse errors in cases that look like they should work).

Supposing we normalize/straighten the quotes, we might apply a similar strategy to other UTF-8 characters that look like important punctuation.

Related to #9545.

bradcray commented 5 years ago

I feel mixed about this issue. On one hand it feels like a no-brainer; on the other, I feel nervous about opening up the Pandora's box of allowing UTF-8 in the source. If we did support it, I'd be inclined to require an open double quote to be closed with a close double quote just to make things consistent (and, arguably, provide another way to have quotes within strings without resorting to escape characters and the like? speaking of which, we'd want to support open and close single quotes as well presumably).

What are examples of non-ASCII space characters? Non-breaking space or linefeeds?

mppf commented 5 years ago

On one hand it feels like a no-brainer; on the other, I feel nervous about opening up the Pandora's box of allowing UTF-8 in the source.

I thought we already planned to open that box in #9545 ? Maybe you are saying the situation is different for language-relevant punctuation?

If we did support it, I'd be inclined to require an open double quote to be closed with a close double quote just to make things consistent

That would work for me. (Mainly, I just want to be able to paste code from slides and have it work... without having to fiddle with quotes). However - I think normalizing them to ASCII equivalents avoids some of the Pandora's box, personally.

What are examples of non-ASCII space characters? Non-breaking space or linefeeds?

Hmm, I'm having trouble finding a code sample in a slide that doesn't compile when pasted due to Unicode space characters (there are plenty that have curly quotes).

An example space character is non-breaking space \u00A0. There is a list of other Unicode space characters here: http://jkorpela.fi/chars/spaces.html and https://en.wikipedia.org/wiki/Whitespace_character#Unicode

I think that the funny space characters might only end up in a code file with my editor if there are other Unicode characters in the pasted data, though... it seems to be trying to normalize some of them. Not sure.

Anyway, I don't personally think there is any particular risk to allowing the Unicode space characters (they render as spaces and are treated as spaces...)

bradcray commented 5 years ago

I thought we already planned to open that box in #9545 ? Maybe you are saying the situation is different for language-relevant punctuation?

The question has been raised, but I didn't recall our having made any commitments or stated any intentions to necessarily do it (my sense has been that the reaction has been somewhere in the tepid-to-negative range). But even if we did decide to do it and I just don't remember, doing this part now would obviously start us down that path, which may or may not have consequences related to taking on the whole thing (e.g., the expectation may grow that we support more than that now).

bradcray commented 4 years ago

An interesting case related to this theme came up in https://github.com/chapel-lang/chapel/issues/15589 in which a user had used n-dashes / minus signs rather than ASCII hyphens for its -1 expressions. This example both makes me think that the proposal to extend to other symbols has merit (looking at the source code, I couldn't determine why it wasn't working even though I believed it should). Yet also is a good example of the "Pandora's box" I referred to above (I can imagine a slippery slope of cases to wonder and worry about. Like, should m-dashes also be considered equivalent to - even though they don't technically serve that purpose when used properly?)

If we decided not to support additional characters, improving our error messages for such cases would be useful. For example, the error message in #15589 didn't make it at all clear that the problem related to using a non-ASCII character, and it broke the UTF-8 symbol into its component bytes, making it unrecognizable to my eyes.

ben-albrecht commented 3 years ago

If we choose to open Pandora's box, I think UTF-8 characters should be used very sparingly and primarily as a syntactic sugar in the base language / libraries.

It would be helpful for IDEs to support tab-completing an ascii representation of UTF8 chars in order to make it easier to write, similar to Julia's unicode support supporting tab-completion for the LaTeX-equivalents, e.g. \dot tab-completes to •.

However, it may take some time (O(years)) for editors to support this functionality, so it could make sense for the language to support the ascii representations directly, for example:

// All these expressions are the same thing:
var A = B.dot(C);
var A = B • C; // UTF-8 syntactic sugar
var A = B \dot C; // (human-readable) ascii-representation of UTF-8 syntactic sugar

As tooling gets better, maybe linters / CI tools can automatically convert ascii-representations to the UTF-8 form.

chapel-lang / chapel

Should the Chapel parser allow curly quotes and UTF-8 spaces? #14282