Lexing priority when literal matches several token definitions

JoAllg commented 2 years ago

consider the following defintion test.cf with a token rule named "Int":

token` Int '-'? digit+ ;

entrypoints Prog ;
Code. Prog ::= Instr ";" [Instr] ;

separator Instr ";" ;
_.      Instr ::= Instr ";" ;
IPUSH.  Instr  ::= "PUSH" CType Data ;
IADD.   Instr  ::= "DROP" ;

DInt.   Data         ::= Int ;
CInt.   CType        ::= "int" ;

There is no problem with this definition, i.e

bnfc -d -m -o test_haskell test.cf && make -C test_haskell
echo "DROP; PUSH int -5; PUSH int 1;" | test_haskell/Test/Test

Parse Successful!

[Abstract Syntax]

Code IADD [IPUSH CInt (DInt (Int "-5")),IPUSH CInt (DInt (Int "1"))]

[Linearized tree]

DROP;
PUSH int -5;
PUSH int 1

however if I add another token rule token Nat digit+ ;above the "Int" rule, then the positive Number 1 in my test is not recognized! The negative number -5 is fine though.

Parse              Failed...

Tokens:
1:1     "DROP"
1:5     ";"
1:7     "PUSH"
1:12    "int"
1:16    "-5"
1:18    ";"
1:20    "PUSH"
1:25    "int"
1:29    "1"
1:30    ";"
syntax error at line 1, column 29 before `1'

This does not seem to happen with other rules on top of the Nat rule, e.g. token Hex '0''x' ( digit | ["abcdefABCDEF"] )+ ;

PS: btw, why are Integer and Double defined as positive numbers? Seems strange to me or I do not know how you would use this definitions to work with negative numbers.

andreasabel commented 2 years ago

@JoAllg:

however if I add another token rule token Nat digit+ ;above the "Int" rule, then the positive Number 1 in my test is not recognized!

This is a general effect with lexical analysis: Number 1 matches both Int and Nat, so the ambiguity is resolved by priority. Since Nat comes before Int, it gets the priority for equally long matches. Number -5 only matches Int, so it is still accepted.

If you want both Nat and Int, I recommend to define Int as a category (via a grammar rule) and not as token type, e.g.:

Pos.  Int ::= Nat;
Neg.  Int ::= "-" Nat;

PS: btw, why are Integer and Double defined as positive numbers? Seems strange to me or I do not know how you would use this definitions to work with negative numbers.

See above, the - can be handled in the grammar.

It wasn't my design choice, but restricting to positive numbers seems more robust. Unary minus -x can be awkward to have if there is also binary minus (E.g. 0 - x). E.g., in GHC there is the somewhat weird asymmetry:

ghci> :t (- 1)

(- 1) :: Num a => a
ghci> :t (+ 1)

(+ 1) :: Num a => a -> a

Here, (+ 1) is the function that adds 1, but (- 1) is the literal -1.

JoAllg commented 2 years ago

thanks for the reply. I didn't realize that i itnroduced ambiguity in the definition. So even if I remove the Nat toke rule, the ambiguity will still exist with regards to the basic type Integer, right?

Also thanks for the recomendation, I wanted to reduce the introduction of new tokens. But so be it :-)

andreasabel commented 2 years ago

So even if I remove the Nat toke rule, the ambiguity will still exist with regards to the basic type Integer, right?

Yes, but the Integer definition will not be included in the generated lexer unless you mention Integer in your grammar.

The advantage of built-in Integer over self-defined Nat is that in the syntax tree you get an actual number (Haskell type Integer) where as for Nat you would get a string that you have to convert into a number (newtype Nat = Nat String).

BNFC / bnfc

Lexing priority when literal matches several token definitions #379