firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com
3.21k stars 198 forks source link

Regex simplifier tool #1206

Closed kkava closed 4 years ago

kkava commented 4 years ago

Feature

Would be great to have a simplify button to generate an equivalent regex that is optimized for speed, and maybe separately for readability. See for example https://www.reddit.com/r/regex/comments/9wnsnt/simplifying_ridiculous_regex/

firasdib commented 4 years ago

The idea is good, but its very hard to know what you're actually optimizing for.

I guess if you would insert enough input data with positive and negative entries, you would generate something. But I don't think this is an easy task at all.

radekivan commented 4 years ago

https://github.com/firasdib/Regex101.wiki.git

asasine commented 4 years ago

Since regular expressions are conceptually a form of deterministic finite automaton, you could compile then DFA minimization algorithms. This seems like something the language interpreter or compiler should perform.

firasdib commented 4 years ago

@asasine That doesn't hold true for modern regex engines, unfortunately. PCRE (and modern engines) include a lot of complex features, such as backreferences, subroutines, recursion, etc.

verdy-p commented 4 years ago

I agree, not all regexps can be converted to DFA. In general regexps are parsed using only NFA (i.e. using a variable number of active states instead of just one). Trying to convert these NFA to DFA swith a single integer has a combinatorial complexity and the result DFA from a modest NFA could easily not fit at all in memory (a small reggexp of a few bytes could then generate a DFA which requires petabytes or even much more).

Consider the case of regexp with repetitions and bounded counts like "A{1,1000000}B" (using "greedy" repetitions) which is trivial to parse very efficiently with a NFA of a few nodes, and a "non-deterministic" state consisting in only 3 boolean state variables and a single 32-bit integer variable for the counter of repetitions, but impossible to realize with a DFA using a single integer state whose value would be large enough to store a little more than 1000000 bits (yes more than 1 million bits only for that "deterministic" state!) and we've still not stored the graph containing the billions of billions of... of billions of activable nodes (it would require a computer larger than the number of atoms in the universe, the communication channels in such computer would have to travel for enormous distance, you would not get any response from such giant computer before the end of your life or before the whole universe collapses or dissolves to nil, including the computer itself. So this is simply... impossible!)

Compare to the NFA approches that just requires a vector for 3 decision nodes plus some other traverse-only nodes for actions on counters, and never more nodes than the length of the "A{1,1000000}B" regexp, plus 35 bits for the "active state" object: this can be done using less than 100 bytes and can be "compiled" very fast, then the matcher running on this very small compiled NFA will run with excellent data locality, bringing high performance and fast response just depending on the size of the input text to read it but completely independant of the small regexp size). If the regexp is compiled to use "non greedy" repetitions (or is not anchored with "^...$") and you want the matcher to return all possible matches, you'll need also to add a stack for storing up to one million states for backtracking (unless you have included in the NFA nodes a way to rollback the actions taken in the "traverse-only" non-decision nodes, like reverting the incrementation of the counter; in which case this stack is not needed, as you can traverse these non-decision nodes in the reverse direction), but this is still perfectly doable with a NFA and still undoable with a DFA (it just requires a few megabytes to store the maximum stack depth and then be able to return all possible matches).

Regexps containing backreferences, subroutines, recursions are also NOT representable at all with a simple DFA, as you cannot predict the number of deterministic states. In fact, in addition to the set of states in the NFA graph of nodes (active nodes can be represented in a bitset, with one bit per activable node prior to traversing it, but counters are additional integer variables you need to store in that object), the regexp engine collect this set of state variables into an object that is pushed on a stack (with an implementation-dependant maximum depth).

When you compile a regexp actually it parses the syntax and builds a representation of the NFA using memory that is proportional to the length of the regexp, plus the memory for representing each object containining states in that graph (including counters), plus the memory required by the stack and the allocation of a new state object for each allowed depth in the stack.

All this is wellknown: for the general case, regexp languages are NOT simple LL(n) or LR(n) or LALR(n) languages, which can use deterministic algorithms with a single (hopefully small) integer state variable. But all LL(n), LR(n), LALR(n) languages can be represented as regexps. Regexps are much more powerful than most programming languages, whose syntax is often (not always) compiled with a LALR(n) parser: LALR languages are generally favored by most programmers because they look simpler for them to program than regexps (but with the growing cost of debuggability and of a very complex analysis to assert the program is correct and will return what is expected and once you start "optimizing" these LALR-based languages, it's only based on unverifiable assumptions based on limited use cases).

It is possible to create much more powerful (and more efficient) programs with regexps than with classic programming languages, including functional languages. You could entirely program an modern application using a regexp-based NFA language, and in general such program will be much easier to write, easier to test for coverage, will have less bugs, even if it may run a bit slowlier than programs compiled from a deterministic language (but such assertion on performance is limited and not proven, if the program has to work with much more cases and more voluminous data: the NFA approach starts winning for the improved data locality of its implementation, the lower ressource usage, and the better power usage; and the NFA is easier to integrate into hardware than a giant DFA machine requiring billions transistors, and finally costs less for higher final performances and a very natural, scalability if you need more performance without the exponential complexity of DFA approaches which are no longer scalable!).


The compilation of character classes is also not necessarily as a a small bitset if the set of characters is much larger than ASCII, for exampel if it refers to the Unicode character set and this is even worse when the character classes are refering to Unicode character properties (because these set of properties is evoluting with the encoding of new characters with their character properties, and most characters properties may also change over time for the same encoded characters; for this reason, character classes are implemented as collection of static and dynamic rules, the dynamic rules being evaluated by a call to a functional hook and not representable only with static data only: the generated "data" structure will contain pointers to functions (or an index to a builtin table of functions)!)

The compilation of complex substitutions containing text transforms like \L, \U, \E (and possibly other transforms that could be added by an extension of the regexp syntax, including the possibility to embed code inside the regexp, that could be written in its own language like Javascript or that would have to be compiled separately from the rest of the regexp or compiled at run time when compiling the regexp, provided that this language is supported by the regexp engine) also require representing these transformation functions that will take parameters (for identifying the captured groups in the source text, as substrings or as a set of start/end positions in the source text) and will return a new computed text to the regexp engine.

If you want an example of a fully working programming language based on regexps, look at PCCTS, which is itself easily implemented in C, C++, C#, Java, Javascript...

Start forgetting what you learned with LALR languages. Consider that regexps and NFA are languages of the future that will easily solve much more complex problems (note that NFA programs are also easily massively parallelizable...).

Example of applications that are based on NFA are those used in "Big Data" companies for handling massive amounts of data with very complex (and often fuzzy) relations. All modern databsae systems include the support of regexps (possibly extended by other simplification mechanisms, like RDF collections of assertions modeled as triplets, or geographic/geometric extensions for working with multidimensional data, or metacubes, or "hash stores", or IA-decision trees and sets of IA declarations, all of these extensions having a simple representation also as a NFA-implemented regexp).


However there are limited possibilities:

This is quite complex to do and there's in fact no real benefit in performance and the reduction in the regexp size is quite small.

I some cases, the factorization of leading branches may create a regexp that is actually longer: remember that such factorization must not create new capture groups, so you would have to use non-capturing groups for the leading part! As well, working with character classes to detect common alternate branches would require rewriting them (you'd have to compute and represeent new character classes with substracted characters).

Finally once you've detected common leading branches, it's not so easy to merge the trailing parts when they are supposed to be equivalent syntaxically, because they can also contain capture groups that you cannot merge without affecting their numbering (so this would affect other features like recursion, subroutines, or the final result of the matches that expect to contain some values in specific capture groupe numbers.

To improve the performance of regexp engines, the best work has been done on how to represent the compiled NFA in a way that is fast to traverse at each node in the graph, or to compact the structure to maximize the data locality inside the generated NFA, so that the main processing loop of the matcher can be written in a tight code with a reduced set of local variables, and a minimum number of test branches with conditional jumps so that branch prediction is efficient, and as few dynamic table lookups as possible that could affect the branch prediction and data locality in memory caches, and maximizing the use of fast CPU registers to avoid external memory accesses; this is where the designer of the regexpengines makes tradeoffs between simplicity/clarity of implementation and performances according to some comparative measurements by unitary tests).

kkava commented 4 years ago

A universal simplification tool isn't needed. Just something simple that shows you basic 'tricks' that a pro would apply. The simplifier could be as simple as a regex itself!

firasdib commented 4 years ago

@kkava If it can't be generalized, it's just a search and replace, which is meaningless. The basic "tricks" a pro would apply aren't as easy to fit into code, as they are for a pro to come up with :).

I'll close this for now, even though it would be a great feature, as I just don't see how this would be feasible.

verdy-p commented 4 years ago

I do agree. However there are some minor use cases: if the regexp is not just "optimized" alone, but in association with its usage (you can know which capture groups are needed by a substitution pattern, or the application knows which capture groups it needs for searches), then all unnecessary capture groups can be removed: add the "?:" flag after their opening parenthese. Then the regexp can be compiled in a faster engine. But then a regexp engine that has a compile() API could be given a hint about the capture groups needed, could then compile the regexp, find common branches in a more optimal NFA, from which a new regexp can be regenerated. So basically, this is something that can only be done safely with a true implementation of a regexp compiler (that will fully parse it).

On the opposite, reformatting a complex regexp with a "pretty print" that will indent it (using the /x format) is possible with little efforts ands as well you can very simply reduce the /x presentation to remove unprotected whitespaces/newlines and comments: this is very easy to do before (or after) parentheses (with their attached ?* flags), or pipes | for alternates, and quantifiers (?, +, *, {m,n}, {m,}, {,n} with their optional optional greedyness flag ? or * ).


The regexp editor could also include helpers for adding or removing capturing groups (ie. the pairs of parentheses, changing them to non capturing groups is easy, just add "?:" after the leading "(").

More complex helpers could propose:

All these "optimizations" are safe provided they do not change the set of capturing groups and their ordering (if they are implicitly ordered and not named). But actually these optimizations have no real effect on speed for matching, as they can be done simply by the regexp compiler. The only real optimization is in the generated internal representation of the NFA and how the matcher works with this NFA and its internal representation of active states to accelerate the transitions, using as little memory as possible for data locality, avoiding unnecessary reloads of data into registers, avoiding to use the native stack and slow subfunction calls, and reducing the number of jumps with good prediction for frequent cases, minimizing table lookups by using tables that are as small as possible, and an efficient implementation for matching character classes and for the set of active states in the NFA (using bitsets may be slow, but using allocated vectors of integers may be even slower: the matcher must absolutely avoid using allocators, except occasionnally for increasing the stack size needed for rollbacks to previous states if the regexp has to manage non-greedy quantifiers or if matches are not "committed")