Closed YannSuisini closed 2 months ago
The problem comes from the handling of ranges in character classes. In your grammar:
REGEX -> ([0-9a-zA-Z_+-|?{}*\[\]]+) ;
^
There is a problem with the dash (-), it is interpreted as a range, this is probably incorrect given the semantics of the name.
Here this means that every character from +
(U+002B) to |
(U+007C) included, probably not what you want.
The hanging part comes from the DFA minimization that can be expensive when large spans like this one intersects with others. I don't know exactly why it used to work in the version 3 series.
Now, to fix this, there is another bug that prevent you to simply escape the -
. So for the time being you have to resort to the ugly following, that nonetheless works:
REGEX -> ([0-9a-zA-Z_+|?{}*\[\]] | '-')+ ;
To fix this ticket:
\\-
in character classes => not a rangeThanls for the trick ! I firstly escaped all regex specific chars, but I had other error compilers . I ended escaping only the [ and ] to get it working (at least on 3.5)
The problem comes from the handling of ranges in character classes. In your grammar:
REGEX -> ([0-9a-zA-Z_+-|?{}*\[\]]+) ; ^
There is a problem with the dash (-), it is interpreted as a range, this is probably incorrect given the semantics of the name. Here this means that every character from
+
(U+002B) to|
(U+007C) included, probably not what you want.The hanging part comes from the DFA minimization that can be expensive when large spans like this one intersects with others. I don't know exactly why it used to work in the version 3 series.
Now, to fix this, there is another bug that prevent you to simply escape the
-
. So for the time being you have to resort to the ugly following, that nonetheless works:REGEX -> ([0-9a-zA-Z_+|?{}*\[\]] | '-')+ ;
I also got this issue before independently, which my initial educated guess led me to powerset construction, due to the curious observation that the execution time will approach O(2^n) if you have a lot of ranges in your regex and will all of a sudden drop if you replace the regex with a simpler one.
This problem is prominent if you want to implement identifier token with Unicode (UAX #31 to be precise), and then I replaced it with ASCII and Unicode/CJK respectively.
And despite I think I'm ultimately right with my gut feeling, I think it is a separate issue.
Thank you for your feedback. This is now fixed on master with the new (fixed) normalisation process on NFAs. I will make a release shortly today.
Hi ,
I updated my Himecc from v3.5.1 to last stable version v4.2.4. But my grammar cannot be compiled with this version , it's blocking forever. Poolit.zip
Regards,