anachronauts / jeff65

a compiler targeting the Commodore 64 with gold-syntax
GNU General Public License v3.0
6 stars 0 forks source link

case-sensitivity for syntax #8

Open woodrowbarlow opened 6 years ago

woodrowbarlow commented 6 years ago

So, this is a little bit of a weird suggestion. What if we make our language case-insensitive outside of string literals?

I do have a pipe dream of porting to the Commodore, and you know the Commodore is weird about case. Plus the shift key is physically difficult to press in combination with other keys when you're trying to type at any kind of speed.

jdpage commented 6 years ago

In principle I'm okay with this, but I have concerns about locale handling such as the Turkish i character.

jdpage commented 5 years ago

Okay, after my latest round of getting my hands dirty with Unicode for whatever reason, I'd suggest the following:

Two identifiers are considered equivalent if they have the same sequence of codepoints under the NFKC_CaseFold transformation. This does Compatibility Decomposition, which breaks apart precomposed characters, and does things like change superscripts into regular numbers, etc., followed by Canonical Composition. Then it applies Unicode casefolding, which generally maps uppercase to lowercase, except where it doesn't, in such a way as to make case-insensitive comparisons as tractable as possible.

However, because people going around defining a variable as foobar and referring to it as fOoBaR is gross, when identifiers are defined, they are normalized to NFC (that's Canonical Decomposition followed by Canonical Composition). This is the canonical form of the identifier. If the identifier is referred to with a name which is NFKC_CaseFold-equivalent but not NFC-equivalent, a lint is emitted by default. This may be disabled or changed into an error as the programmer desires.

If a symbol is shadowed (by NFKC_CaseFold-equivalence), but the shadowing symbol's identifier is not NFC-equivalent to the shadowed symbol's identifier, and the shadowing symbol is accessed using an identifier which is NFC-equivalent to the shadowed symbol's identifier, then a warning is emitted, rather than a lint.

A clarifying example: if you define a global FOOBAR, and then a local foobar, then the global is shadowed (because they're NFKC_CaseFold-equivalent). If you reference FOOBAR, it would reference the local, rather than the global; since this might be surprising to people who are used to case-sensitive languages, a warning is emitted to let the programmer know that something is up. However, if you reference fOoBaR, it's simply a lint (as above).

See http://unicode.org/reports/tr15/ to learn about NFC, NFD, NFKC, and NFKD; it's fairly readable. See http://www.unicode.org/versions/Unicode12.0.0/ch05.pdf#G21180 to learn about case folding; bring Advil.

This would allow a case-insensitive C64 implementation to be compatible with source files which were restricted to PETSCII-mappable characters, while allowing users of the cross-compiler to name things in whatever language they felt like.

woodrowbarlow commented 5 years ago

that is really well-thought out and well-researched. 👍