chharvey / counterpoint

A robust programming language.
GNU Affero General Public License v3.0
2 stars 0 forks source link

Tokenize and Cook Unicode Identifiers #33

Closed chharvey closed 4 years ago

chharvey commented 4 years ago

Lexical grammar:

Word ::= [A-Za-z_] [A-Za-z0-9_]* | "`" [^`#x03]* "`"

Word Value:

WV(Word ::= ([A-Za-z_] [A-Za-z0-9_]* | "`" [^`#x03]* "`") - Keyword)
    := /* TO BE DETERMINED */

Identifiers may be enclosed in back-ticks (` ` U+0060 GRAVE ACCENT) to allow non-alphanumeric Unicode characters.

let `españa` = 'Spanish for “Spain”';
let `svaret_på_den_ultimata_frågan` = 42; % Sweedish for “the answer to the ultimate question”

Any character except U+0060 GRAVE ACCENT and U+0003 END OF TEXT is allowed inside the delimiters of a Unicode identifier name. These forbidden characters cannot be escaped. In fact, escape sequences of any kind are not possible: `\u{24}αβγ` and `$αβγ` are two different identifier names, and `\t` and ` ` are also different (the latter of these contains a literal tab character U+0009). Unicode identifier names should not contain whitespace.

Unicode identifiers may also contain no characters: the token `​` is a valid identifier.

Identifiers that are declared as Unicode identifiers must always be referenced as such. A ReferenceError should be thrown when attempting to reference an identifier that has not been declared (see #14), even if the identifier differs only in back-tick delimiters.

let `foo` = 16;
foo;            % ReferenceError (foo is not declared)
let bar = 24;
`bar`;          % ReferenceError (`bar` is not declared)

This means that the identifiers foo and `foo` can refer to different values.

let foo = 8;
let `foo` = 16; % allowed