Open jiahao opened 10 years ago
The rabbit hole really is deep with this one. What does the average CJK user do when writing code, just put the IME into English by default and maybe switch back for comments?
Haha, this one is quite fun. Julia 0.2:
julia> 私の名前="Iain"
ERROR: @私の名前=_str not defined
Unicode macros could make for some fun looking code...
I can't speak for everyone else, but I do switch back and forth between US ANSI and Pinyin constantly to type the proper halfwidth punctuation. I think many people don't even bother to try typing non-Roman characters into their code.
The idea here is that we only use a small number of ASCII punctuation symbols, and so if other unicode characters are really aliases of those they should be treated the same. For example we already treat 26 unicode characters as whitespace. I think the fullwidth colon and equals are pretty obvious, but it does get murkier. I'm not sure what to do with the large number of quote characters in particular.
I switch back and forth for this. I offen encounter problems with recognize comma, colon, semicolon, exclamation, question, parenthesis marks between fullwidth and halfwidth. period, quote and other bracket marks seems fine because it's easy to identify them. This is the first time that I realize there is a fullwidh equal mark
I think equal mark should be dealt with since it seldom be used in strings. Let's just leave the others as they be, maybe you'll need full width marks in string someday
Here are some Unicode normalization tables that may be useful, particularly the ones for punctuation.
Rather than starting to add custom exceptions to NFC, my preference would be to start with NFKC (which solves the issue here of multiple input modes in asian languages, as well as e.g. ligatures in Latin scripts or µ
vs. μ
) and add exceptions as needed (if a convincing real-world case arises where we really want to treat two Kompatible symbols as inequivalent). See #5434.
Since we settled on NFC, it might be useful to revisit this issue and add a limited set of custom additions to our Unicode normalization.
The µ (micro) vs. μ (mu) issue just came up again (Keno/SIUnits.jl#23) for example, and I would tend to include this exception as well simply because µ is so easy to type on MacOS (option-m).
Bump. The distinction between micro vs mu is pretty annoying. It would be great to have a decision on this for 0.3.
@IainNZ I just had to try it out for myself, and your use case looks like it's working in 0.3!
julia> 私の名前="鯖"
"鯖"
julia> 私の名前
"鯖"
The best part about this is that TAB-completion actually works, so I can type 私
, hit <TAB>
and it'll autocomplete the rest. :P
The lack of attention for several releases makes me think we can probably let this go until some indefinite time in the future.
Not actually implemented in my PR, though now it's easy to add
Full-width punctuation characters now give "invalid character" parse errors, so I think adding this would be non-breaking. Can probably be deferred.
The current Unicode normalization policy (#5576, #5434) is to employ the NFC normalization to canonicalize identifiers. However, NFC is overly conservative as a choice of canonicalization, since it does not alleviate the possibility of writing obfuscated code using, for example, full-width punctuation characters in identifiers.
Example:
While in general we probably don't want to get into the business of building in semantic knowledge of natural languages into the parser, I think at the very least we should support as synonyms the default output produced by standard input method editors. As an example, setting the input method to Pinyin - Simplified IME on OSX 10.9, typing on the keyboard
bing1=3
selects the first Chinese character with phonetic spellingbing
, then continues with=3
as part of the input stream. The result, when typed directly into the Julia REPL, iswhich stems from the full-width
=
being parsed as part of the identifier rather than the assignment operator, which is arguably what the typical user would have intended.