Parse a minimal set of fullwidth punctuation as synonyms

jiahao commented 10 years ago

The current Unicode normalization policy (#5576, #5434) is to employ the NFC normalization to canonicalize identifiers. However, NFC is overly conservative as a choice of canonicalization, since it does not alleviate the possibility of writing obfuscated code using, for example, full-width punctuation characters in identifiers.

Example:

julia> b＝3:5 #full-width equals
ERROR: b＝3 not defined

julia> b＝3=-1
-1

julia> [b＝3:5]
7-element Array{Int64,1}:
 -1
  0
  1
  2
  3
  4
  5

While in general we probably don't want to get into the business of building in semantic knowledge of natural languages into the parser, I think at the very least we should support as synonyms the default output produced by standard input method editors. As an example, setting the input method to Pinyin - Simplified IME on OSX 10.9, typing on the keyboard bing1=3 selects the first Chinese character with phonetic spelling bing, then continues with =3 as part of the input stream. The result, when typed directly into the Julia REPL, is

julia> 丙＝3
ERROR: 丙＝3 not defined

which stems from the full-width ＝ being parsed as part of the identifier rather than the assignment operator, which is arguably what the typical user would have intended.

IainNZ commented 10 years ago

The rabbit hole really is deep with this one. What does the average CJK user do when writing code, just put the IME into English by default and maybe switch back for comments?

IainNZ commented 10 years ago

Haha, this one is quite fun. Julia 0.2:

julia> 私の名前＝"Iain"
ERROR: @私の名前＝_str not defined

Unicode macros could make for some fun looking code...

jiahao commented 10 years ago

I can't speak for everyone else, but I do switch back and forth between US ANSI and Pinyin constantly to type the proper halfwidth punctuation. I think many people don't even bother to try typing non-Roman characters into their code.

JeffBezanson commented 10 years ago

The idea here is that we only use a small number of ASCII punctuation symbols, and so if other unicode characters are really aliases of those they should be treated the same. For example we already treat 26 unicode characters as whitespace. I think the fullwidth colon and equals are pretty obvious, but it does get murkier. I'm not sure what to do with the large number of quote characters in particular.

wlbksy commented 10 years ago

I switch back and forth for this. I offen encounter problems with recognize comma, colon, semicolon, exclamation, question, parenthesis marks between fullwidth and halfwidth. period, quote and other bracket marks seems fine because it's easy to identify them. This is the first time that I realize there is a fullwidh equal mark

wlbksy commented 10 years ago

I think equal mark should be dealt with since it seldom be used in strings. Let's just leave the others as they be, maybe you'll need full width marks in string someday

jiahao commented 10 years ago

Here are some Unicode normalization tables that may be useful, particularly the ones for punctuation.

stevengj commented 10 years ago

Rather than starting to add custom exceptions to NFC, my preference would be to start with NFKC (which solves the issue here of multiple input modes in asian languages, as well as e.g. ligatures in Latin scripts or µ vs. μ) and add exceptions as needed (if a convincing real-world case arises where we really want to treat two Kompatible symbols as inequivalent). See #5434.

stevengj commented 10 years ago

Since we settled on NFC, it might be useful to revisit this issue and add a limited set of custom additions to our Unicode normalization.

The µ (micro) vs. μ (mu) issue just came up again (Keno/SIUnits.jl#23) for example, and I would tend to include this exception as well simply because µ is so easy to type on MacOS (option-m).

timholy commented 10 years ago

Bump. The distinction between micro vs mu is pretty annoying. It would be great to have a decision on this for 0.3.

staticfloat commented 10 years ago

@IainNZ I just had to try it out for myself, and your use case looks like it's working in 0.3!

julia> 私の名前="鯖"
"鯖"

julia> 私の名前
"鯖"

The best part about this is that TAB-completion actually works, so I can type 私, hit <TAB> and it'll autocomplete the rest. :P

StefanKarpinski commented 7 years ago

The lack of attention for several releases makes me think we can probably let this go until some indefinite time in the future.

stevengj commented 7 years ago

Not actually implemented in my PR, though now it's easy to add

JeffBezanson commented 7 years ago

Full-width punctuation characters now give "invalid character" parse errors, so I think adding this would be non-breaking. Can probably be deferred.

JuliaLang / julia

Parse a minimal set of fullwidth punctuation as synonyms #5903