JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.55k stars 5.47k forks source link

canonicalize unicode identifiers #5434

Closed stevengj closed 10 years ago

stevengj commented 10 years ago

As discussed on the mailing list, It is very confusing that

const μ = 3
µ + 1

throws a µ not defined exception (because unicode codepoints 0x00b5 and 0x03bc are rendered almost identically). This could easily be encountered in real usage because option-m on a Mac produces 0x00b5 ("micro sign"), which is different from 0x03bc ("Greek small letter mu").

It would be good if Julia internally stored a table of easily confused Unicode codepoints, i.e. homoglyphs, and used them to help prevent these sorts of confusions. Three possibilities are:

My preference would be for the third option. I don't see any useful purpose being served by treating μ and µ as distinct identifiers.

johnmyleswhite commented 10 years ago

+100 for this. Any strategy for ensuring that homoglyphs are merged seems like a big improvement to me.

toivoh commented 10 years ago

+1 for canonicalizing everything.

stevengj commented 10 years ago

(We should probably also normalize the Unicode identifiers, in addition to canonicalizing homoglyphs.)

stevengj commented 10 years ago

On possible software package that we could adapt for this might be utf8proc, which is MIT-licensed and fairly compact (600 lines of code plus a 1M data file). It looks like it does Unicode normalization, but not homograph canonicalization (except for a small number of special cases?). Looks like it handles homoglyphs for us.

jiahao commented 10 years ago

+1 for canonicalization and normalization.

We certainly don't want the same disambiguation issues with combining diacritics and nonprinting control characters (like the right to left specifier). The Unicode list contains quite a few characters with combining diacritics already; not sure if it's exhaustive though.

stevengj commented 10 years ago

Actually, it looks like the utf8proc library completely solves this problem, because it implements (among other things) the standard "KC" Unicode normalization which canonicalizes homoglyphs.

I just compiled the utf8proc library and called it from Julia via:

function snorm(s::ByteString, options=0)
       r = Ptr{Uint8}[C_NULL]
       e = ccall((:utf8proc_map,:libutf8proc), Int, (Ptr{Uint8},Csize_t,Ptr{Ptr{Uint8}},Cint), s, sizeof(s), r, options)
       e < 0 && error(bytestring(ccall((:utf8proc_errmsg,:libutf8proc), Ptr{Uint8}, (Int,), e)))
       return bytestring(r[1])
end

and then

julia> s = "µ"
julia> uint16(snorm(s)[1])
0x00b5
julia> uint16(snorm(s, (1<<1) | (1<<2) | (1<<3) | (1<<5) | (1<<12))[1])
0x03bc

works (the second argument is various canonicalization flags copied from the utf8proc.h header file).

Moreover, the utf8proc canonicalization functions (including Unicode-aware case-folding and diacritical-stripping) would be useful to have in Julia anyway. I vote that we just put the whole utf8proc into deps and export some version of this functionality in Base, in addition to canonicalizing identifiers.

jiahao commented 10 years ago

Awesome, thanks for doing the legwork on this.

JeffBezanson commented 10 years ago

That sounds like a really good idea to me.

pao commented 10 years ago

KC has one case that we probably don't care about but seems worth mentioning: superscript numerals will be normalized to normal numerals. (We probably don't care because why would you have superscript numerals in a numeric literal, but this seems like the sort of thing to be abused in a future International Obfuscated Julia Coding Contest.)

JeffBezanson commented 10 years ago

That's not totally ideal; is a cute variable name :)

jiahao commented 10 years ago

I've actually used χ² somewhere.

JeffBezanson commented 10 years ago

We also have to avoid normalizing out different styled letters that represent different symbols in mathematics.

nalimilan commented 10 years ago

The problem with ² is that it seems to mean ^2, so maybe it's better not to encourage it.

jiahao commented 10 years ago

@JeffBezanson may be referring to what UAX #15 calls font variants (see Fig. 2). They give as an example \mathfrak H vs \bbold H, but I suspect regular \phi vs script \varphi is the one that would come up fairly often. (Ironically, Github won't let me enter the characters...)

So it seems that we are learning toward canonical equivalence, as opposed to full compatibility equivalence, in which case NFD may be sufficient rather than NFKC.

pao commented 10 years ago

For variable names, I don't see the superscript/subscript being as much of a problem, other than i.e., χ² will be the same identifier as χ2; if you are distinguishing these I might think you were mad.

JeffBezanson commented 10 years ago

Our use case is very different from something like a text formatter, which wants to know that superscript 2 is a 2. In a programming language any characters that look different should be considered different. We can perhaps be flexible about superscripts, but font variants of letters have to be supported.

mathpup commented 10 years ago

The initial issue raised involved confusion over U+00B5 MICRO SIGN and U+03BC GREEK SMALL LETTER MU. Normalization type NFD would not fix this problem since U+00B5 has only a compatibility decomposition to U+03BC and not a canonical decomposition. NFKC will fix that issue. The utility at http://unicode.org/cldr/utility/transform.jsp?a=Any-NFKC%0D%0A&b=µ is useful for this.

stevengj commented 10 years ago

@JeffBezanson, I'm not convinced that "characters that look different should be considered different." One problem is that, unlike LaTeX, we cannot rely on a particular font/glyph being used to render particular codepoints. U+00B5 and U+03BC look distinct in some fonts (one is rendered italic) and not in others, for example. Moreover, even when codepoints are rendered distinctly, the difference will often be subtle (χ² versus χ2) and hence an invitation for bugs and confusion. (That's why these variants work for phishing scams, after all.)

I would prefer to simply state that identifiers are canonicalized to NFKC, so that only characters that look entirely distinct (as opposed to potentially slight font variations) are treated as distinct identifiers. It's useful to have variables named µ and π, but Julia shouldn't pretend that it is LaTeX.

StefanKarpinski commented 10 years ago

There are several different levels of distinction being discussed:

  1. "Indistinguishables". Different unnormalized but strongly equivalent forms – i.e. byte sequences that mean the same things but are represented different, such as precomposed characters like U+0065, U+0301 vs. U+00E9.
  2. "Strong confusables". Characters like μ vs. µ and other things listed here that are semantically distinct but will often cause confusion and frustration due to very similar rendering.
  3. "Weak confusables." Character sequences that are normally easy to distinguish but might end up looking similar in some renderings, e.g. χ² vs. χ2.

These call for different approaches. To deal with "indistinguishables" it's pretty clear that we should just normalize them. At the other end of the spectrum, this is a pretty lousy way to deal with "weak confusables" – imagine using both χ² and χ2 in some code and being really confused when they are silently treated as the same identifier! For weak confusables, I suspect the best behavior is to treat them as distinct but emit a warning if two weakly confusable identifiers are used in the same file (or scope). In the middle, strong confusables are a tougher call – both automatically normalizing them to be the same (like with indinstinguishables) and warning if they appear in the same file/scope (like weak confusables) are reasonable approaches. However, I tend to favor the warning.

I've intentionally avoided Unicode terms here to keep the problem statement separate from the solution. I suspect that we should first normalize source to NFD, which takes care of collapsing "indistinguishables". Then we should warn if two identifiers are the same modulo "compatibles" and "confusables". That means that using composed and uncomposed versions of è in the same source file would just silently work – they mean the same thing – but using both χ² vs. χ2 or and ffi in the same file would produce a warning and then proceed to treat them as distinct.

ivarne commented 10 years ago

@StefanKarpinski Good summary! but I think you have the wrong conclusion.

I was once challenged to find out why 10l would compare unequal to 101, in a C program (it was more elaborated), but because the font I could not find the bug.

My preference would definitely be to make Julia consider all possible ambiguous characters equal, and give a warning/error if someone use identifiers that is considered equal because of rule 2 and 3. I do not read Unicode codepoints, and i do not have a different word for and ffi, and I can't even see the difference when I am focused on logic. To me programming is about expressing ideas, and variables using both and ffi as different variables in the same scope would be the worst offence to any code style guide.

StefanKarpinski commented 10 years ago

Well, that's why it should warn. Whether it considers them the same or different is somewhat irrelevant when it causes a warning. I guess one benefit of considering such things the same rather than keeping them different is ease of implementation: if the analysis is done at the file level, you can canonicalize an entire source file and warn if two "confusable" identifiers are used in the same source file and then hand the canonicalized program off to the rest of the parsing process without worrying any further. Then again, you can do the same without considering them the same by doing the confusion warning at the same step but leaving confusable identifiers different.

stevengj commented 10 years ago

As a practical matter, it is far easier to implement and explain canonicalization to NFKC, taking advantage of the existing standard and utfproc, than it would be to implement and document our own nonstandard normalization. (There are a lot of codepoints we'd have to argue over.)

We can also certainly issue a warning whenever a file contains identifiers that are distinct from their canonicalized versions. (But I think it would be an unfriendly practice to issue a warning instead of canonicalizing.)

JeffBezanson commented 10 years ago

It seems unfortunate to me to canonicalize distinct characters that unicode provides specifically for their use in mathematics.

Should we use a different normalization, maybe NFD, for string literals?

stevengj commented 10 years ago

I don't think string literals should be normalized at all by default, although we should provide functions to do normalization if that is desired. The user should be able to enter any Unicode string they want.

jiahao commented 10 years ago

+1 for what @stevengj said. There's something to be said for preserving user input as much as possible. (What if the user wants to implement a custom normalization, for example...)

nolta commented 10 years ago

Just to be perverse, let's say we normalize to NFKC, and Quaternions.jl gets renamed ℍ.jl. Then using ℍ would look for .julia/H/src/H.jl?

StefanKarpinski commented 10 years ago

I've actually rampantly made the assumption that package names are ASCII largely because I think it's opening a whole can of worms to use non-ASCII characters in package names.

JeffBezanson commented 10 years ago

I'm much more concerned about identifier names. I don't think merging and H makes sense for us.

StefanKarpinski commented 10 years ago

@stevengj – what about the χ² vs. χ2 issue? Your proposal silently treats them as the same, which strikes me as almost as bad as the (thus far hypothetical) problems we're trying to avoid here.

StefanKarpinski commented 10 years ago

Actually, no, it's worse – at least you can look at the contents of your source file and discover that two similar looking identifiers are actually the same. If χ² and χ2 are treated as the same identifier, there's no way to figure it out short of finding the obscure appendix of the Julia manual that explains this behavior. I find that unacceptable.

vtjnash commented 10 years ago

I would like to point out that (on my Mac), even the strong confusing symbols render noticably differently. Swapping one for the other would maintain meaning, but loses a significant amount of typographic readability.

I agree that this normalization should only apply to symbols (variable names), and I think it should only apply to Indistinguishables. Hopefully nobody tries to use X2, χ² and χ2 in their code, in much the same was as avoiding similar words (like I vs l) is a good idea

JeffBezanson commented 10 years ago

Everyone agrees that you shouldn't use both Ill1I1 and Il1IlI as variable names, but nobody thinks a language should silently canonicalize them to the same thing.

StefanKarpinski commented 10 years ago

That seems to be what @stevengj is arguing for.

stevengj commented 10 years ago

Yes, I think Julia should canonicalize to H internally. You are free to use as a variable name if you want, you just aren't free to use it as a distinct variable from H. Why is this such a loss for the language?

Conceptually, this is quite a familiar thing. If I use a syntax-highlighting text editor, it might change the font of certain variables. No one thinks that this changes the meaning of the identifiers.

To ordinary programmers (as opposed to Unicode geeks), a µ is a μ. I shudder to think of trying to explain this distinction to my students. (In contrast, everyone understands that I and l and 1 are distinct characters even though they look similar.)

StefanKarpinski commented 10 years ago

That's not the part that's problematic. The problem is doing it silently. If you happen to have an editing environment where and H are obviously quite different, then it is completely surprising – in a way that's impossible to discover the cause of – that they are treated as the same identifier. That is not ok.

stevengj commented 10 years ago

I don't know that it's surprising. My reaction would be Oh, it treats different fonts as the same identifier. I guess that makes sense. Because to ordinary people, ℍ and H are the "same character" in different fonts. (And if you're a Unicode nerd, you know about normalizations. But the vast majority of scientific programmers are not Unicode nerds.)

nalimilan commented 10 years ago

@stevengj I don't think that would be your reaction. You wouldn't even have considered that and H had anything to do with one another. Without a warning, you wouldn't even notice that two different identifiers are considered identical. See this potential example:

julia> ℍ = 2
[many complex lines of code]
julia> H = 1
julia> ℍ
1 # WTF?!
StefanKarpinski commented 10 years ago

Yes, exactly. That's really not ok. If there's a warning, then you know something bad is going on.

lindahua commented 10 years ago

In an editor/IDE/whatever you write your code, you use the same font for all the codes in the same window (you might change the font of course, but your changed font applies to every character in your working area). I would never expect the editor to use font A for this variable, while using font B for another. Therefore, I would expect the same name to appear exactly the same in my editor -- when they look different, they are different.

lindahua commented 10 years ago

Here is my two cents:

I've never encountered such problems in real coding practice, but I understand that this may become a concern in particular context. For such cases, I think a better way might be to provide tools to detect identifiers that might look strikingly similar and modify them with the code author's approval. Blindly treating two identifiers as the same thing just because they may look similar (e.g. H and ) is, to me, a recipe to disastrous confusion.

That being said, if two characters always look the same and there are virtually no ways to distinguish them visually, it might be safe to canonicalize them. But we should be conservative about this.

toivoh commented 10 years ago

I wonder if Julia is really the first programming language to face such issues? I guess that many languages still stick to ascii identifiers to be safe. I know that Java has unicode identifiers, but my quick googling only turned up heated debates on whether to use unicode identifiers at all.

jiahao commented 10 years ago

The Fortress programming language uses Unicode extensively, but even they have had absolutely nothing to say about normalization issues in the language specification. (pdf) From what I can tell, one usually codes the symbols as ASCII identifiers rather than inputing them directly.

mathpup commented 10 years ago

The bold, italic, and sans-serif attributes in the mathematical variants of mu do not represent different fonts. In fact, each has a unique unicode code point.

On the other hand, the editor may substitute a character from a different font if the requested character is not available. To be specific, in Xcode I use Monaco in the editor. If I insert a Greek letter mu U+03BC, the editor actually uses the mu from Lucida Grande because that character is not available in Monaco.

On Sun, Jan 19, 2014 at 9:17 AM, Dahua Lin notifications@github.com wrote:

In an editor/IDE/what ever you write your code, you use the same font for all the codes in the same window (you might change the font of course, but your changed font applies to every character in your working area). I would never expect the editor use font A for this variable, while using font B for another. Therefore, I would expect the same name to appear exactly the same in my editor -- when they look different, they are different.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32710501 .

JeffBezanson commented 10 years ago

Yes, that's right. Unicode doesn't care about fonts; it provides differently-styled letters precisely because they are used as distinct symbols in mathematics. If it weren't for that use case (which is our use case), those characters wouldn't exist.

Many in the lisp/scheme world argue for case-insensitive identifiers because to them letter case is just a personal style choice, with the same character underneath. For example some people like to name functions in all-uppercase where they are defined and otherwise use lowercase. However, those people are wrong.

mathpup commented 10 years ago

Just to be clear, the mathematical variants of mu (bold, italic, sans-serif) are distinct Unicode code points and can be present in the same font. On the other hand, a code editor might borrow a character from another font if it is not available in the requested font. Xcode does this.

By the way, I looked more carefully at micro versus mu, and in Xcode's default font Menlo, they appear to be identical. I don't mean similar. I mean that at 288 point on the screen, overlaid on top of each other, they look identical.

On Sun, Jan 19, 2014 at 12:23 PM, Jeff Bezanson notifications@github.comwrote:

Yes, that's right. Unicode doesn't care about fonts; it provides differently-styled letters precisely because they are used as distinct symbols in mathematics. If it weren't for that use case (which is our use case), those characters wouldn't exist.

Many in the lisp/scheme world argue for case-insensitive identifiers because to them letter case is just a personal style choice, with the same character underneath. For example some people like to name functions in all-uppercase where they are defined and otherwise use lowercase. However, those people are wrong.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32715371 .

stevengj commented 10 years ago

I understand that different codepoints have nothing to do with choosing different fonts. I just think most people will perceive them as different fonts of the "same character".

JeffBezanson commented 10 years ago

Whatever various standard normalizations might say, I think there is a real distinction between characters that are truly identical (like the two mus), and characters that are the same abstract letter but intended to look quite different, like H vs. double-struck H. "Same character" is of course subjective and depends on the application, but in math double-struck letters are decidedly different symbols with different meanings.

mathpup commented 10 years ago

One reasonable solution would be to restrict the set of characters in identifiers to a documented subset of Unicode. Allowing arbitrary characters in identifiers seems to be inviting problems.

On Sun, Jan 19, 2014 at 1:45 PM, Jeff Bezanson notifications@github.comwrote:

Whatever various standard normalizations might say, I think there is a real distinction between characters that are truly identical (like the two mus), and characters that are the same abstract letter but intended to look quite different, like H vs. double-struck H. "Same character" is of course subjective and depends on the application, but in math double-struck letters are decidedly different symbols with different meanings.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32717731 .

IainNZ commented 10 years ago

That would need to be a fairly large subset though - whats the point of Unicode identifiers if you don't support the various languages of the world?

mathpup commented 10 years ago

I would prefer restricting identifiers to ASCII characters.

On Sun, Jan 19, 2014 at 4:28 PM, Iain Dunning notifications@github.comwrote:

That would need to be a fairly large subset though - whats the point of Unicode identifiers if you don't support the various languages of the world?

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32724326 .