Closed lkluc closed 2 years ago
This happens because DecodeChar() gets called with a Unicode category, which is not present in the encoded chars dictionary. I believe this line in LexerStateToNfa()
if ctx.caseInsensitive && c <> Eof then
should be
if ctx.caseInsensitive && c <> Eof && not (IsUnicodeCategory c) then
Note that IsUnicodeCategory() is otherwise unused.
If someone can confirm, I'll submit a PR.
@jimfoye I took a look and it looks like you are right. The only thing that bothers me is the question of how should unicode categories like lowercase letters or uppercase letters behave in case insensitive mode. I will check how ocamllex handles that tomorrow and come back here with the findings.
I checked the docs and it looks like ocamllex does not provide the case-insensitive option.
@dsyme or anyone else, thoughts?
The proposed fix looks reasonable. However I didn't do the implementation of case insensitivity.
Ah right, I think @gdziadkiewicz is who I need to ask.
@jimfoye I see 3 solutions for the problem:
Which of them fits your use case the most? Do you expect the Unicode lowercase letters category to match only lowercase letters in case-insensitive mode or all letters both uppercase and lowercase? I don't plan to use both Unicode and case-insensitive options at the same time so as far as I'm concerned you can go with any of those.
Thanks for the feedback! I need to defer to @lkluc and others to comment on their use cases, as really I was just jumping in with some debugging skills to figure out why the crash was happening. I may not be the best person to submit a PR for whatever is decided upon.
Hello all, thanks for following up !
I would have to double check, it's been a long time, but I think my current use case does not really needs to have unicode support enabled BUT, I am french Canadian and we are using characters with accents (such as "é", "à", "ï"...).
We do usually try to avoid those characters in program code and data as it is often sources of problems... (as you can see :) ) but I easily imagine us (or someone else out there) to have the need to recognize that "é" is the same as "É".
My use case is dealing with math equations, so as I like my programs to be as close as fool proof as possible, even if it is not supposed to happen to received such characters in variable names, my program deals with real life objects and there could be an ID of something real passing through with such characters, and I don't want my program to crash if it could otherwise behave correctly (the data comes from another department, specs could evolve, intern mistakes... ;) )
I think "option 1" is to avoid at all cost because it would mean the parser would no longer accept any Unicode string, right ?
Since I don't have a unicode character in my symbols/tokens of the lexer, I guess if "option 3" is hard to implement, "option 2" would be okay.
So this means I could still have my Unicode being recognize (even in variable names) my case insensitive option activated (so that "max", "Max", "mAx" would be all the same token) but I could not have "équation" being the same token as "Équation" (unless specifying it explicitly in the lexer...) I am right ? If so, option 2 seems qui reasonable to me !
Thanks !
Hi, I found time to fix and motivation to fix this. I did a check of the Unicode categories and it looks like it's simpler than I expected. I plan to go with option 3 and the mapping of the Unicode categories will transform each of the three cased categories into all three of the cased categories.
While this is not hard to do currently there are no working Unicode tests and I decided to change that first by resurrecting the legacy Unicode tests (and test2 while at it, and adding the new Expecto test projects to the pipeline). So I need to first finish #157 and get it merged. After that expect a PR that will add additional Unicode+caseInsensitive tests and a fix for this issue.
While adding provided code as a test case I noticed that this issue contains two independent bugs:
-i
does not work with provided fsl and reports use of Unicode which is totally unexpected.-i
does not work with --unicode
The second one is due to Unicode Categories not being handled properly as explained in the discussion and will be fixed in #158 .
For the first one, I will provide more details and hopefully a fix today.
Case-insensitivity implementation assumes that upper and lower case letters are always contained together in the code page. This is not true for example for 0xFF for which the upper case letter is 0x178:
It affects
| _
rule from the attached test case and will probably also affect | [^ charset]
rules and literal use of the problematic letters.
Description
When trying to generate the lexer using the command line, I get a crash with an error message :
If I remove the "-i" option, it passes :
When trying to remove the "--unicode" option it says I should use it :
This is strange because I don't see any unicode characters in my text, even "file" says so :
Thanks!
Repro steps
Use this file to reproduce :
Lexer_fail_option_i.txt
Known workarounds
It seems to work if I don't activate the case insensitive option.
Related information