Activating case insensitive option crash the lexer generator

lkluc commented 3 years ago

Description

When trying to generate the lexer using the command line, I get a crash with an error message :

$ dotnet ../packages/FsLexYacc.10.2.0/build/fslex/netcoreapp3.1/fslex.dll Lexer_fail_option_i.txt -i --unicode -o Lexer.fs
compiling to dfas (can take a while...)
FSLEX: error FSL000: System.Collections.Generic.KeyNotFoundException: The given key '4294967040' was not present in the dictionary.
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at FsLexYacc.FsLex.AST.CompileRegexp@187(FSharpMap`2 macros, NfaNodeMap nfaNodeMap, Regexp re, NfaNode dest) in /Users/sergey/github/FsLexYacc/src/FsLex/fslexast.fs:line 195
   at FsLexYacc.FsLex.AST.trs@190-1.Invoke(Regexp re) in /Users/sergey/github/FsLexYacc/src/FsLex/fslexast.fs:line 190
   at Microsoft.FSharp.Primitives.Basics.List.mapToFreshConsTail[a,b](FSharpList`1 cons, FSharpFunc`2 f, FSharpList`1 x)
   at Microsoft.FSharp.Primitives.Basics.List.map[T,TResult](FSharpFunc`2 mapping, FSharpList`1 x)
   at FsLexYacc.FsLex.AST.CompileRegexp@187(FSharpMap`2 macros, NfaNodeMap nfaNodeMap, Regexp re, NfaNode dest) in /Users/sergey/github/FsLexYacc/src/FsLex/fslexast.fs:line 190
   at FsLexYacc.FsLex.AST.trs@262.Invoke(Int32 n, Tuple`2 x) in /Users/sergey/github/FsLexYacc/src/FsLex/fslexast.fs:line 262
   at Microsoft.FSharp.Primitives.Basics.List.mapiToFreshConsTail[a,b](FSharpList`1 cons, FSharpFunc`3 f, FSharpList`1 x, Int32 i)
   at Microsoft.FSharp.Primitives.Basics.List.mapi[T,TResult](FSharpFunc`2 f, FSharpList`1 x)
   at Microsoft.FSharp.Collections.ListModule.MapIndexed[T,TResult](FSharpFunc`2 mapping, FSharpList`1 list)
   at FsLexYacc.FsLex.AST.LexerStateToNfa(FSharpMap`2 macros, FSharpList`1 clauses) in /Users/sergey/github/FsLexYacc/src/FsLex/fslexast.fs:line 262
   at FsLexYacc.FsLex.AST.Compile@391-1.Invoke(Tuple`2 tupledArg) in /Users/sergey/github/FsLexYacc/src/FsLex/fslexast.fs:line 392
   at FsLexYacc.FsLex.Driver.main() in /Users/sergey/github/FsLexYacc/src/FsLex/fslex.fs:line 90

If I remove the "-i" option, it passes :

$ dotnet ../packages/FsLexYacc.10.2.0/build/fslex/netcoreapp3.1/fslex.dll Lexer_fail_option_i.txt --unicode -o Lexer.fs
compiling to dfas (can take a while...)
7 states
writing output

When trying to remove the "--unicode" option it says I should use it :

$ dotnet ../packages/FsLexYacc.10.2.0/build/fslex/netcoreapp3.1/fslex.dll Lexer_fail_option_i.txt -i -o Lexer.fs
compiling to dfas (can take a while...)
FSLEX: error FSL000: the Unicode character '178' may not be used unless --unicode is specified

This is strange because I don't see any unicode characters in my text, even "file" says so :

$ file Lexer_fail_option_i.txt
Lexer_fail_option_i.txt: ASCII text

Thanks!

Repro steps

Use this file to reproduce :

Lexer_fail_option_i.txt

Known workarounds

It seems to work if I don't activate the case insensitive option.

Related information

Windows 10
Version 10.2.0

jimfoye commented 2 years ago

This happens because DecodeChar() gets called with a Unicode category, which is not present in the encoded chars dictionary. I believe this line in LexerStateToNfa()

if ctx.caseInsensitive && c <> Eof then

should be

if ctx.caseInsensitive && c <> Eof && not (IsUnicodeCategory c) then

Note that IsUnicodeCategory() is otherwise unused.

If someone can confirm, I'll submit a PR.

gdziadkiewicz commented 2 years ago

@jimfoye I took a look and it looks like you are right. The only thing that bothers me is the question of how should unicode categories like lowercase letters or uppercase letters behave in case insensitive mode. I will check how ocamllex handles that tomorrow and come back here with the findings.

gdziadkiewicz commented 2 years ago

I checked the docs and it looks like ocamllex does not provide the case-insensitive option.

jimfoye commented 2 years ago

@dsyme or anyone else, thoughts?

dsyme commented 2 years ago

The proposed fix looks reasonable. However I didn't do the implementation of case insensitivity.

jimfoye commented 2 years ago

Ah right, I think @gdziadkiewicz is who I need to ask.

gdziadkiewicz commented 2 years ago

@jimfoye I see 3 solutions for the problem:

Make Unicode and case-insensitive options mutually exclusive.
Do the fix as you proposed (it means excluding Unicode categories from the case-insensitive effect).
Define case-insensitive transformations for Unicode categories.

Which of them fits your use case the most? Do you expect the Unicode lowercase letters category to match only lowercase letters in case-insensitive mode or all letters both uppercase and lowercase? I don't plan to use both Unicode and case-insensitive options at the same time so as far as I'm concerned you can go with any of those.

jimfoye commented 2 years ago

Thanks for the feedback! I need to defer to @lkluc and others to comment on their use cases, as really I was just jumping in with some debugging skills to figure out why the crash was happening. I may not be the best person to submit a PR for whatever is decided upon.

lkluc commented 2 years ago

Hello all, thanks for following up !

I would have to double check, it's been a long time, but I think my current use case does not really needs to have unicode support enabled BUT, I am french Canadian and we are using characters with accents (such as "é", "à", "ï"...).

We do usually try to avoid those characters in program code and data as it is often sources of problems... (as you can see :) ) but I easily imagine us (or someone else out there) to have the need to recognize that "é" is the same as "É".

My use case is dealing with math equations, so as I like my programs to be as close as fool proof as possible, even if it is not supposed to happen to received such characters in variable names, my program deals with real life objects and there could be an ID of something real passing through with such characters, and I don't want my program to crash if it could otherwise behave correctly (the data comes from another department, specs could evolve, intern mistakes... ;) )

I think "option 1" is to avoid at all cost because it would mean the parser would no longer accept any Unicode string, right ?

Since I don't have a unicode character in my symbols/tokens of the lexer, I guess if "option 3" is hard to implement, "option 2" would be okay.

So this means I could still have my Unicode being recognize (even in variable names) my case insensitive option activated (so that "max", "Max", "mAx" would be all the same token) but I could not have "équation" being the same token as "Équation" (unless specifying it explicitly in the lexer...) I am right ? If so, option 2 seems qui reasonable to me !

Thanks !

gdziadkiewicz commented 2 years ago

Hi, I found time to fix and motivation to fix this. I did a check of the Unicode categories and it looks like it's simpler than I expected. I plan to go with option 3 and the mapping of the Unicode categories will transform each of the three cased categories into all three of the cased categories.

While this is not hard to do currently there are no working Unicode tests and I decided to change that first by resurrecting the legacy Unicode tests (and test2 while at it, and adding the new Expecto test projects to the pipeline). So I need to first finish #157 and get it merged. After that expect a PR that will add additional Unicode+caseInsensitive tests and a fix for this issue.

gdziadkiewicz commented 2 years ago

While adding provided code as a test case I noticed that this issue contains two independent bugs:

-i does not work with provided fsl and reports use of Unicode which is totally unexpected.
-i does not work with --unicode

The second one is due to Unicode Categories not being handled properly as explained in the discussion and will be fixed in #158 .

For the first one, I will provide more details and hopefully a fix today.

gdziadkiewicz commented 2 years ago

Case-insensitivity implementation assumes that upper and lower case letters are always contained together in the code page. This is not true for example for 0xFF for which the upper case letter is 0x178: Screenshot 2022-01-28 011116 It affects | _ rule from the attached test case and will probably also affect | [^ charset] rules and literal use of the problematic letters.

fsprojects / FsLexYacc