eclipse-archived / ceylon

The Ceylon compiler, language module, and command line tools
http://ceylon-lang.org
Apache License 2.0
399 stars 62 forks source link

node.js doesn’t like non-BMP identifiers #2941

Open CeylonMigrationBot opened 9 years ago

CeylonMigrationBot commented 9 years ago

[@lucaswerkmeister]

[]𐐨=[];

(that’s U+10428 DESERET SMALL LETTER LONG I)

The compiler passes it through without problems, but node.js can’t handle it:

…/tmp-1.1.0.js:12

function 𐐨(){return $valinit$$2();}
         ^

SyntaxError: Unexpected token ILLEGAL at Module._compile (module.js:439:25) at Object.Module._extensions..js (module.js:474:10) at Module.load (module.js:356:32) at Function.Module._load (module.js:312:12) at Module.require (module.js:364:17) at require (module.js:380:17) at [eval]:1:85 at Object. ([eval]-wrapper:6:22) at Module._compile (module.js:456:26) at evalScript (node.js:532:25)

Found after a suggestion by @tombentley in #2067, though the issue is much simpler here. (When this bug is fixed, the JS model loader should be tested against #2067 as well.)

  1. New personal bug golf record, just 7 characters! :fireworks:
  2. I am in awe that the first lowercase letter of the Mormon alphabet is decimal 66600. Someone at the Unicode Consortium has a great sense of humor.

[Migrated from ceylon/ceylon-js#510]

CeylonMigrationBot commented 9 years ago

[@lucaswerkmeister] Node, Firefox, and Chromium all reject both the identifiers 𐐨 (the character) and \uD801\uDC28 (its UTF-16 representation). I’m not sure if this is in compliance with the ECMAScript specification. 7.6 “Identifier Names and Identifiers” of ECMA-262 references the “Identifiers” section of chapter 5 of the Unicode standard (5.15 Identifiers, page 227), which itself points to Annex #2462, “Unicode Identifier and Pattern Syntax”. That document does not mention planes or encodings, but does speak of “code points”, which in the Glossary are defined to mean “Any value […] from 0 to 10FFFF16”. However, ECMA-262 7.6 speaks of “characters”, not “Unicode characters”, which per section 6 means “a single 16-bit unit of text”. Furthermore, the section explicitly mentions Unicode 3.0 as the reference version of the standard, and it appears that while the concept of multiple Unicode planes was introduced in Unicode 2.0, blocks outside the Basic Multilingual Plane were only added starting with version 3.1 of the standard; therefore, even if an implementation could read Unicode characters that span multiple code units, it would not be required to know whether any of these characters have a Letter category, and could still reject them in an identifier.

The upshot of all this is, looks like you need to encode these names with something like $u$10428.

CeylonMigrationBot commented 9 years ago

[@chochos] Hm, that $u$10428 doesn't look all that bad...

CeylonMigrationBot commented 9 years ago

[@lucaswerkmeister] Well that’s only a single-character example. It’ll probably have to be uglier for multiple characters in order to be unambiguous between “U+1234 U+56789“ and “U+12345 U+6789”. Perhaps $u$1234$56789.

CeylonMigrationBot commented 9 years ago

[@chochos] oh of course it will be ugly. Actually maybe $u123456$u56789 would be better. Or the hex value to make it shorter.

CeylonMigrationBot commented 9 years ago

[@lucaswerkmeister] Those were supposed to be hex values. (Probably shouldn’t have used values above 10FFFF.)

And I’m not familiar with the other escapings the JS runtime has… if $u is unambiguous, sure, that actually looks pretty nice.

CeylonMigrationBot commented 9 years ago

[@chochos] So there's an initial implementation, but I'm sure it requires some more thorough testing: both toplevel and nested values, types, functions, parameters, aliases, etc.