cenotelie / hime

Apache License 2.0
27 stars 4 forks source link

Himecc is blocking forever during compilation (v4.2.4) #103

Closed YannSuisini closed 2 months ago

YannSuisini commented 12 months ago

Hi ,

I updated my Himecc from v3.5.1 to last stable version v4.2.4. But my grammar cannot be compiled with this version , it's blocking forever. Poolit.zip

Regards,

woutersl commented 12 months ago

The problem comes from the handling of ranges in character classes. In your grammar:

REGEX -> ([0-9a-zA-Z_+-|?{}*\[\]]+) ;
                      ^

There is a problem with the dash (-), it is interpreted as a range, this is probably incorrect given the semantics of the name. Here this means that every character from + (U+002B) to | (U+007C) included, probably not what you want.

The hanging part comes from the DFA minimization that can be expensive when large spans like this one intersects with others. I don't know exactly why it used to work in the version 3 series.

Now, to fix this, there is another bug that prevent you to simply escape the -. So for the time being you have to resort to the ugly following, that nonetheless works:

REGEX -> ([0-9a-zA-Z_+|?{}*\[\]] | '-')+ ;
woutersl commented 12 months ago

To fix this ticket:

YannSuisini commented 12 months ago

Thanls for the trick ! I firstly escaped all regex specific chars, but I had other error compilers . I ended escaping only the [ and ] to get it working (at least on 3.5)

stevefan1999-personal commented 11 months ago

The problem comes from the handling of ranges in character classes. In your grammar:

REGEX -> ([0-9a-zA-Z_+-|?{}*\[\]]+) ;
                      ^

There is a problem with the dash (-), it is interpreted as a range, this is probably incorrect given the semantics of the name. Here this means that every character from + (U+002B) to | (U+007C) included, probably not what you want.

The hanging part comes from the DFA minimization that can be expensive when large spans like this one intersects with others. I don't know exactly why it used to work in the version 3 series.

Now, to fix this, there is another bug that prevent you to simply escape the -. So for the time being you have to resort to the ugly following, that nonetheless works:

REGEX -> ([0-9a-zA-Z_+|?{}*\[\]] | '-')+ ;

I also got this issue before independently, which my initial educated guess led me to powerset construction, due to the curious observation that the execution time will approach O(2^n) if you have a lot of ranges in your regex and will all of a sudden drop if you replace the regex with a simpler one.

This problem is prominent if you want to implement identifier token with Unicode (UAX #31 to be precise), and then I replaced it with ASCII and Unicode/CJK respectively.

And despite I think I'm ultimately right with my gut feeling, I think it is a separate issue.

woutersl commented 2 months ago

Thank you for your feedback. This is now fixed on master with the new (fixed) normalisation process on NFAs. I will make a release shortly today.