dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.19k stars 1.57k forks source link

Numeric and named subroutines in regex #48978

Open FMorschel opened 2 years ago

FMorschel commented 2 years ago

A common real-world use is to match a balanced set of parentheses. But for example, inside a mathematical equation, we can have multiple nested sets of parentheses.

So in this case, a subroutine would be a really nice feature, we could call it and be more specific in what we want. Pearl compatible Regular Expressions already have that.

Say we have (\(.*\)). This gets us a set of parentheses and anything inside it. We could do something like (?<G>\([0-9]\+([0-9]|(?&G))\)) which would recursively get (_digit_+(_digit_+(_digit_+_digit_))) for example.

lrhn commented 2 years ago

Dart is matching JavaScript RegExps because that makes it possible to compile to efficient JavaScript and keep the same semantics. Unless JavaScript adds RegExp-procedures to their implementation, I don't think Dart will either.

Fair warning: Having named procedures that can be recursive means that the language recognized is no longer regular. It can instead be context-free, like the languages of normal "language grammars". The regexp implementation effectively becomes a push down automaton instead of a finite state machine.

(That's not a reason not to do it anyway, just a warning that it might not be as efficient as you'd hope. Also, the current RegExp languages are not regular, or even context free, because of unlimited back-references. The language recognized by RegExp(r"^([ab]*)\1$") is not context free.)