chharvey / counterpoint

A robust programming language.
GNU Affero General Public License v3.0
2 stars 0 forks source link

Normalize Line Endings #16

Closed chharvey closed 4 years ago

chharvey commented 4 years ago

Improve handling of line endings. See XML for example.

Before tokenizing, all instances of CRLF and CR (that is, matching %\u000d\u000a|\u000d%) should be replaced with a single LF, U+000A. This should happen everywhere including inside comments and strings; thus, this step needs to be taken before tokenization.

This greatly simplifies the lexical grammar (#1) and and benefits line ending issues inside comments (#4) and string literals (#5, #7).

This change does not affect programmers who use LF line endings. For programmers who use CRLF line endings, this step will transform the source text that is input into the compiler, but will not affect the actual source file. Line and column numbers will be preserved. Important: If programmers use any CR characters not followed by LF in source code, and those CR characters are not rendered as newlines, then this step will affect line and column numbers.

Note that the escape sequences \r and \u{d} will still produce a literal CR character when cooked.

Updated lexical grammar rules:

Whitespace ::= Whitespace? (#x20 | #x09 | #x0A)

StringLiteralChars ::=
    [^'\#x03]               StringLiteralChars?   |
    "\" StringLiteralEscape StringLiteralChars?   |
    "\u"     ([^'{#x03]     StringLiteralChars?)? |
    [-remove-] "\" #x0D ([^'#x0A#x03]  StringLiteralChars?)?

LineContinuation ::= #x0A
NonEscapeChar    ::= [^'\stnru#x0A#x03]

Remove these rules from the String Literal Value algorithm:

[-remove-] SVL(StringLiteralChars ::= "\" #x0D)
[-remove-]  is 0x0D
[-remove-] SVL(StringLiteralChars ::= "\" #x0D [^'#x0A#x03])
[-remove-]  is 0x0D followed by {@link Util.utf16Encoding|UTF16Encoding}(code point of that character)
[-remove-] SVL(StringLiteralChars ::= "\" #x0D [^'#x0A#x03] StringLiteralChars)
[-remove-]  is 0x0D followed by {@link Util.utf16Encoding|UTF16Encoding}(code point of that character) followed by SVL(StringLiteralChars)

[-remove-] SVL(LineContinuation ::= #x0D #x0A)
[-remove-]  is 0x20
chharvey commented 4 years ago

copied into v0.1.0 via 8b65dd888b4efc01a31a08651c9392fa517ccad9