"unrecognized input" after upgrade #208

Open danabr opened 1 month ago

danabr commented 1 month ago


I recently upgraded from FsLexYacc 10.0 to the latest 11.3.0. After the upgrade, parsing a comment line // ä now fails with "unrecognized input". I have made no changes to the lexer or parser options, nor to the parser or lexer definitions.

Repro steps

I have managed to create a small-ish reproducer:


%token EOF
%token <string*FSharp.Text.Lexing.Position> IDENTIFIER

%start top
%type <string> top


top: EOF { "hello" }


module Lexer

open FSharp.Text.Lexing
open Parser

let lexeme lexbuf = LexBuffer<char>.LexemeString lexbuf


let alpha = ['a' - 'z' 'A' - 'Z']
let swe = ['ä' 'Ä' 'ö' 'Ö' 'å' 'Å' ]
let letter = alpha | swe
let ident = letter+
let newline = ('\n' | "\r\n" )

rule token = parse
| "//"           { commentline lexbuf.StartPos lexbuf }
| ident          { IDENTIFIER(lexeme lexbuf, lexbuf.StartPos) }
| newline        { token lexbuf }
| eof            { EOF }
| _              { failwith "unknown token" }

and commentline p = parse
| newline        { token lexbuf }
| eof            { EOF }
| _              { commentline p lexbuf }


open Parser
open Lexer

let input = "// ä"
let lexbuf = FSharp.Text.Lexing.LexBuffer<_>.FromString input
let result = Parser.top Lexer.token lexbuf

printfn "%s" result


<Project Sdk="Microsoft.NET.Sdk">

    <PackageReference Include="FsLexYacc.Runtime" Version="11.3.0" />
    <PackageReference Include="FsLexYacc" Version="11.3.0" />

    <FsLex Include="Lexer.fsl">
    <FsYacc Include="Parser.fsy">
      <OtherFlags>--module Parser</OtherFlags>
    <Compile Include="Parser.fs" />
    <Compile Include="Lexer.fs" />
    <Compile Include="Program.fs" />

Expected behavior

When running the program above with dotnet run the output should be "hello".

Actual behavior

We get an exception with the stacktrace:

Unhandled exception. System.Exception: unrecognized input
   at FSharp.Text.Lexing.LexBuffer`1.EndOfScan() in /home/runner/work/FsLexYacc/FsLexYacc/src/FsLexYacc.Runtime/Lexing.fs:line 128
   at FSharp.Text.Lexing.UnicodeTables.scanUntilSentinel(LexBuffer`1 lexBuffer, Int32 state) in /home/runner/work/FsLexYacc/FsLexYacc/src/FsLexYacc.Runtime/Lexing.fs:line 448
   at Lexer.commentline(Position p, LexBuffer`1 lexbuf) in C:\cygwin64\home\daab\dev\FsLexYaccRepro\Lexer.fs:line 81
   at Lexer.token(LexBuffer`1 lexbuf) in C:\cygwin64\home\daab\dev\FsLexYaccRepro\Lexer.fs:line 18
   at Program.result@6.Invoke(LexBuffer`1 lexbuf)
   at FSharp.Text.Parsing.Implementation.interpret[tok,a](Tables`1 tables, FSharpFunc`2 lexer, LexBuffer`1 lexbuf, Int32 initialState) in /home/runner/work/FsLexYacc/FsLexYacc/src/FsLexYacc.Runtime/Parsing.fs:line 346
   at FSharp.Text.Parsing.Tables`1.Interpret[char](FSharpFunc`2 lexer, LexBuffer`1 lexbuf, Int32 startState) in /home/runner/work/FsLexYacc/FsLexYacc/src/FsLexYacc.Runtime/Parsing.fs:line 498
   at Parser.engine[a](FSharpFunc`2 lexer, LexBuffer`1 lexbuf, Int32 startState) in C:\cygwin64\home\daab\dev\FsLexYaccRepro\Parser.fs:line 111
   at Parser.top[a](FSharpFunc`2 lexer, LexBuffer`1 lexbuf) in C:\cygwin64\home\daab\dev\FsLexYaccRepro\Parser.fs:line 113
   at <StartupCode$FsLexYaccRepro>.$Program.main@() in C:\cygwin64\home\daab\dev\FsLexYaccRepro\Program.fs:line 6

Note that parsing the input "// a" works fine. Also, parsing works if I remove ä from swe in Lexer.fsl.

danabr commented 1 month ago

Bisection indicates that the regression was introduced with 48ec571 (break out core domain logic and generation into core libraries (#144), 2021-01-27).