Consider unicode identifiers support

MaximSokolov commented 8 years ago

There is already support in TextMate, particularly in language-babel I've tested this regex and it seems to be working fine (see at Lightshow): [$_\\p{L}\\p{Nl}][$\\p{L}\\p{Nl}\\p{Mn}\\p{Mc}\\p{Nd}\\p{Pc}\\x{200C}\\x{200D}]*

function a () { }
function foo123 () { }
function $ () { }
function $$abc$$ () { }
function FOO () { }
function _foo_ () { }
function $foo_foo$ () { }

function π() {  }
function ლ_ಠ益ಠ_ლ() {}
function абв() {}
function d‿d() {} //\\p{Pc}
function Oo̶O()  {} // \p{Mn}
function _ැ_() {} //\p{Mc}
function می‌خواهم() {} // \x{200C}
function _ണ്‍_() {} // \x{200D}, valid in ECMAScript 6/Unicode 8.0.0, but not in ES3
function _۴_() {} // \p{Nd}
function Ⅳ() {} // \p{Nl}

screen shot 2016-08-23 at 12 33 09

\p{L} matches any kind of letter from any language \p{Nl} matches a number that looks like a letter, such as a Roman numeral \p{Mn} matches a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.) \p{Mc} matches a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages) \p{Nd} matches a digit zero through nine in any script except ideographic scripts \p{Pc} matches a punctuation character such as an underscore that connects words \x{200C} zero width non-joiner \x{200D} zero width joiner

Refs: JavaScript variable name validator Unicode Character Categories What characters are valid for JavaScript variable names? [Stack Overflow]

Alhadis commented 7 years ago

I was abusing Cyrillic to cheat a reserved identifier, and I noticed the same thing. Note how const highlighting continues after the bogus e:

figur -1

const nеw = isNew ? "new " : "";

I'm probably the one who should address this, which I'll do once:

File-Icons and its spec-runner have been fixed
Less and Sass's keywords are hooked up to our slick new CSS grammar
I've exorcised NPM from @50Wliu's computer

winstliu commented 7 years ago

My problem with this is that I don't want to overcomplicate the grammar. If [\w$] supports all of those though, then I am +100 since all that would require is adding error highlighting for function names that begin with a number.

Alhadis commented 7 years ago

Surprise: Oniguruma has a Unicode-aware \w character class, but GitHub's PCRE doesn't (since they're running it in ASCI mode for performance reasons).

While I can't see this causing breakage, I'd prefer this grammar's highlighting remain consistent wherever it's used...

Alhadis commented 7 years ago

On an interesting side-note, CoffeeScript's interpretation of valid identifiers differs to JavaScript's. I have the following snippets in my snippets.cson file, completely unquoted:

"Symbol Snippets":
    €:  {prefix: "C=",  body: "€"}
    ″:  {prefix: ",,",  body: "″"}
    ™:  {prefix: "TM",  body: "™"}
    ©:  {prefix: "(C)", body: "©"}
    ©2: {prefix: "(c)", body: "©"}
    ®:  {prefix: "(R)", body: "®"}
    ®2: {prefix: "(r)", body: "®"}
    ×:  {prefix: "x",   body: "×"}
    →:  {prefix: "->",  body: "→"}
    ←:  {prefix: "<-",  body: "←"}
    ⇒:  {prefix: "=>",  body: "⇒"}
    ⇐:  {prefix: "<=",  body: "⇐"}

The keys don't receive highlighting, but CoffeeScript allows them anyway. Unfortunately, it neglects to quote them for JS.... at least on their site's REPL. The output on the right breaks if parsed as JavaScript:

Just a reminder to avoid CoffeeScript whenever possible. =)

atom / language-javascript

Consider unicode identifiers support #414