commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.6k stars 527 forks source link

re2c: Disable UTF-8 #540

Closed nwellnhof closed 3 months ago

nwellnhof commented 3 months ago

The regexes don't require UTF-8 features and work in ASCII mode as well. Disabling UTF-8 reduces the size of the code generated by re2c by a couple of KBs.

I regenerated the regex code with re2c 3.0 because that's what I have on Ubuntu 22.04 and I had to add a (void) marker line to suppress an unused variable warning. Feel free to regenerate with your version of re2c.

nwellnhof commented 3 months ago

There's still quite of bit a bloat in the re2c generated code but that's hard to fix. The main issue is that re2c seems to handle {m} style quantifiers by creating m copies of the subregex. This approach is taken by regex engines like RE2 (unrelated to re2c) as well but isn't well-suited to ahead-of-time compilation.