PrismJS / prism

Lightweight, robust, elegant syntax highlighting.
https://prismjs.com
MIT License
12.3k stars 1.3k forks source link

Use Regex Optimizer #2397

Open idleberg opened 4 years ago

idleberg commented 4 years ago

Motivation Since I contributed the NSIS component in 2014, its RegEx patterns have been optimized by other contributors. I'm not entirely sure about the motivation for these (better performance? smaller code?), but either way, these optimizations make it harder to maintain the component. For the most part, I continued adding new syntax, but over the time my motivation continually decreased.

Description Libraries such as regexp-tree provide an API to optimize patterns. I wonder whether you have considered using this (or similar libraries) to create optimized builds for language components. The result would be a codebase that is both, maintainable and optimized (for whatever reason, see above).

RunDevelopment commented 4 years ago

Most of the transformations may not optimize (as in making it faster) but rather minify the pattern (e.g [0-9] -> [\d], [\d] -> \d). This is very useful for making patterns more understandable because these trivial optimizations remove almost all redundancy and generally produce patterns that do the same thing with less code. (Some transformations (quantifiersMerge, disjunctionRemoveDuplicates, and groupSingleCharsToCharClass) may make your patterns faster because they can "fix" polynomial and exponential backtracking.)

I don't think that the build process should do these optimizations. I think that the programmer shouldn't add the redundancies in the first place!

With this in mind, I wrote this little ESLint plugin: eslint-plugin-clean-regex. The main focus is to detect redundancies, potential errors, and plain errors. (Almost) all rules that detect redundancies also provide a fixer, so you can use it as an optimizer.

It's not quite ready yet but I plan to release v1.0 in the next few weeks.

idleberg commented 4 years ago

To give an example to illustrate the complexity of some of the patterns:

File(BufSize|Close|ErrorText|Open|Read|ReadByte|ReadUTF16LE|ReadWord|WriteUTF16LE|Seek|Write|WriteByte|WriteWord)?

There is even more potential to optimizing them (see FileRead*), but that makes it even more complex to maintain – and prone to errors.

(Almost) all rules that detect redundancies also provide a fixer, so you can use it as an optimizer.

This sounds good to me, but I guess making good use of this also requires implementing a linter to the project. Personally, I like using husky with a pre-commit hook that runs the linter.

RunDevelopment commented 4 years ago

There is even more potential to optimizing them (see FileRead*), but that makes it even more complex to maintain – and prone to errors.

I should point out that factoring out shared prefixes/suffixes can actually degrade the matching performance in some cases. I implemented something like this for #1826 and found the resulting patterns were slower. All I did was to factor out all prefixes/suffixes of adjacent alternations, so File(?:BufSize|Close|ErrorText|Open|Read|ReadByte|ReadUTF16LE|ReadWord|WriteUTF16LE|Seek|Write|WriteByte|WriteWord)? would have been tranformed to File(?:BufSize|Close|ErrorText|Open|Read(?:Byte|Word|UTF16LE)?|Seek|Write(?:Byte|Word|UTF16LE)?)?.

Note 1: Some reordering because it assumed that the order of alternatives doesn't matter. This is isn't true in general but keyword lists are usually wrapped in \bs. Note 2: BufSize|Close weren't factored out to (?:BufSiz|Clos)e because the suffix is too small. I had a metric to determine if a prefix/suffix was factored out. Small prefixes/suffixes of few alternatives weren't factored out. The actual metric was more complex to allow for cases like a(?:bc|bd) -> ab[cd], FooBar|FooBaz -> Fooba[rz], and (?:aa|ab|ac|...|az) -> a(?:a|b|c|...|z) (one shared prefix/suffix for all alternatives). Note 3: The purely prefix-and-suffix-based algorithm wasn't smart enough for factor out Read|Write. I also re-implemented the whole thing via DFA minimization that did factor out Read|Write but it's really hard to convert a minimal DFA to minimal regex (It's easy to get a regex but it's hard to get a minimal regex; Basically, you might have to revert some of that factoring out to get a regex).

The end result of #1826 was that I spend about two weeks implementing the factoring to make the patterns slower. My best guess as to why they were slower is that browsers heavily optimize that simple alternation case and all the factoring out prevents the browser from optimizing.

Factoring out prefixes/suffixes isn't a bad idea per se (if the prefix/suffix is common enough) but it also won't necessarily improve performance.

Personally, I like using husky with a pre-commit hook that runs the linter.

I'm actually against a pre-commit hook. It might stop some people from contributing because Prism literally doesn't let them commit.

I'd rather have them commit whatever they like and then the tests/CI tell them to fix things.

joshgoebel commented 3 years ago

The end result of #1826 was that I spend about two weeks implementing the factoring to make the patterns slower. My best guess as to why they were slower is that browsers heavily optimize that simple alternation case and all the factoring out prevents the browser from optimizing.

So sorry to hear that it was a waste of time, but also so glad to hear this news. We're definitely moving towards "easier to read and easy maintain" with Highlight.js... ie, long lists of keywords in an array that is just compiled to a simple (a|b|c|d|e|...) at run-time the first time a grammar is "compiled"... so I'm glad to hear there aren't huge losses to this quite simple approach.

We still have lots of legacy grammar code with heavily "optimized" regexs written this way and they're a nightmare to make any sense of... I should maybe look into using your REFA stuff to see if it could actually "unwind" regexs like this back into simple a, b, c, or d patterns. :-)

RunDevelopment commented 3 years ago

I should maybe look into using your REFA stuff to see if it could actually "unwind" regexs like this back into simple a, b, c, or d patterns. :-)

Yes, refa can do that. I actually use it for these kinds of tasks sometimes. Using the toNFA function, you do it like this:

[...toNFA(/a|b|cd?/).words()].map(w => Words.fromUTF16ToString(w)).forEach(s => console.log(s));

Kind in mind that refa can't handle assertions at all rn. For regexes like /\b(?:foo|bar)\b/, you have to pass it /(?:foo|bar)/. Also, words will iterate over all strings that the input regex can accept, so be sure to remove the i flag.


I also want to say that some time has passed since #1826 and I know that v8 improved their regex engine during that time. You might want to re-test the results I got at the time.

You can also use refa to (attempt to) minifimize regexes:

const dfa = toDFA(/a|b|cd?/);
dfa.minimize();
const minRegex = toLiteral(dfa);

(Some functions are from here.)

joshgoebel commented 3 years ago

I think I agree with idleburg - at least regarding his specific example here. We (speaking for HLJS) want the most readable and easy to maintain code for grammars (clean diffs, etc). If someone did a benchmark and showed that "optimized" was truly HUGELY faster then we'd choose to do that either at build time or "compile time" (inside the parser)... we wouldn't force that maintenance onto the maintainers of the individual grammar source files.

This type of optimization problem smells very much like a machine problem to solve, not a human one.

Yes, refa can do that. I actually use it for these kinds of tasks sometimes.

OMG, thanks for that snippet, I'm totally trying it later.

RunDevelopment commented 3 years ago

I think I agree with idleburg - at least regarding his specific example here.

Me too. I just wanted to argue against letting an automate process do this.

joshgoebel commented 3 years ago

Yeah I think I agree with your in the smaller cases, though I'd have to see specific examples. I have weird ideas perhaps somethings like I'll use [ ] for a literal space when i ONLY want a space cause I thin it makes it more visible? Good? Bad? I dunno, but it works for me, lol. :)

RunDevelopment commented 3 years ago

I'll use [ ] for a literal space

OMG... IMO [ ] is the regex equivalent of var a = (0) in JS - just unnecessary braces. You might even argue that it's less readable since you have to mentally parse a character class.

joshgoebel commented 3 years ago

LOL. Spaces just disappear otherwise if you ask me. :) I use \s when I can except for when I need to rule out newlines...

RunDevelopment commented 3 years ago

Really? What do think about spaces in strings then? ;) Also, when I need \s without newlines, I use [ \t].

joshgoebel commented 3 years ago

Also, when I need \s without newlines, I use [ \t].

That's probably more accurate (and we do that sometimes) but we don't think about tabs much. And no no one seems to report issues so most of the time it must not matter or people are using spaces.

Really? What do think about spaces in strings then? ;)

Well duh if it's obvious it's a space:

/(SELECT|DELETE|QUERY FROM/

But here the spaces can "get lost" too easily. Is there a space? Is that one space or two? Is that space intentional? :) I feel like I kind of dislike raw spaces in regex for the most part for some reason.

/# (\d+) #/

VS

/#[ ](\d+)[ ]#/

No it's not the best example, but it's what I had handy... Also, just my take, you don't have to agree. :)