chharvey / counterpoint

A robust programming language.
GNU Affero General Public License v3.0
2 stars 0 forks source link

Number Tokens, Positive and Negative #17

Closed chharvey closed 4 years ago

chharvey commented 4 years ago

Allow number tokens to be immediately preceded by a positive sign + or negative sign -.

Problem

Previously, e.g., -42 would be lexed as two tokens:

[
    { type: PUNCTUATOR , value: '-'  },
    { type: NUMBER     , value: '42' },
]

However, we need the cooked value of the token to be determined before parsing, due to future constraints with signed integers (see explanation below). Therefore we cannot rely on the unary operator in the syntactic grammar. This issue prepares future versions by alleviating those constraints early.

Solution

Add the unary operators to number tokens in the lexical grammar:

Number ::= ("+" | "-")? IntegerLiteral

Update the Mathematical Value algorithm:

MV(Number ::= "+" IntegerLiteral)
    is MV(IntegerLiteral)
MV(Number ::= "-" IntegerLiteral)
    is -1 * MV(IntegerLiteral)

Impacts

Warning: This introduces a breaking change, specifically, when the binary operators + and - appeared directly before a number literal.

Before this change, 8+5 would be lexed as three tokens:

[
    { type: NUMBER     , value: '8' },
    { type: PUNCTUATOR , value: '+' },
    { type: NUMBER     , value: '5' },
]

(which happens to be a well-formed expression per the syntactic grammar).

But after the change, it will be lexed as two:

[
    { type: NUMBER, value:  '8' },
    { type: NUMBER, value: '+5' },
]

(which is no longer well-formed).

Therefore, in order to accommodate this change, whitespace must be inserted after the binary operators + and - in expressions. The expression 8+5 must be changed to either 8+ 5 or 8 + 5.

Background

This fix is needed in order for the transformer (the mechanism that sends tokens to the parser) to determine the actual value of number tokens during lexical analysis. An example of the problem follows.

In two’s complement representation, 4-bit signed integer values range from -8 to 7 (represented as 1000 and 0111, respectively), so we should be able to write the literal -8 in source code. But the lexer doesn’t see -8 as a single token; it sees two tokens: a punctuator with value -, followed by an integer with value 8. The translator will then fail to compute the mathematical value of the token 8, since it’s out of range (there’s no bit sequence that represents 8 in signed 4-bit precision).

If we allow the token to include -, then the translator will successfully compute its mathematical value as -8 and represent it as 1000 in memory.