BeRo1985 / flre

FLRE - Fast Light Regular Expressions - A fast light regular expression library
GNU Lesser General Public License v2.1
94 stars 23 forks source link

Multiple RE match? #1

Closed zunzster closed 8 years ago

zunzster commented 9 years ago

Hi Benjamin

I'm considering trying out BRRE and now FLRE since a fast thread safe native Delphi library based on RE2 sounded awesome. It looks like a very nice piece of work!

Given that your code is based on Cox's RE2, I wondered if you had done any work on supporting the "single string against multiple regexs" in linear time use-case which is referred to here http://fulmicoton.com/posts/multiregexp/

By way of background, I'm looking at building a REST Web service framework in Delphi and I'll need to match the inbound URL against a collection of URI templates to determine the handler for it.

e.g.

/api/customers -> handler 1 /api/customers/{:id} -> handler 2 /api/orders -> handler 3 /api/orders/{:id} -> handler 4

It's trivial enough to generate an appropriate regex from each of the URI templates and then to iterate through them (sorted appropriately such that longest wins) and trying to matching each in turn.

You can see that a multiregexp that tries to match against them all at once and determine the longest match would save lots of repeated matching work for the prefixes. You'd want to have it return which regex 'won' and be able to extract the relevant match groups that contain the URL parameters.

Is this something FLRE can do?

Thanks in advance, Paul.

BeRo1985 commented 9 years ago

The code is not based on the RE2 code itself, but rather the basic ideas are based on them of RE2.

Whether multiple regexp at once would be possible, I need to check it first, because FLRE even more optimizations used than RE2, eg approximate heuristic regular expression prefix matching based on SBNDMQ2 as prefilter (and a bit more other stuff), which I've actually designed and implemented primary for approximate heuristic single regular expression pattern matching.

If so, then I would probably implement it this way:

regexp1|(NumberedGroupReset)regexp2|(NumberedGroupReset)regexp3|(NumberedGroupReset)regexp4|...

Even for that, I need to check it first, if it would be possible with the current FLRE code structure, since FLRE abuses non-capture-but-through-internally-captured groups for simulating support for back-references without the need of a slow backtracking algorithm, where numbered group resets would be possibly problematic.

zunzster commented 9 years ago

Ah, right - that makes sense.

Essentially, compiling them as alternates with additional 'reset group' byte codes to return only the groups for the winning alternative sub-rule from the match VM. A clean and simple approach.

And my apologies, I didn't mean 'cribbed' from re2 but rather inspired by Cox's design/articles.

The optimisations you've made sound interesting too - I wasn't aware of SBNDMQ2 but I've found a nice paper about the various techniques - http://arxiv.org/pdf/1012.2547.pdf

BeRo1985 commented 9 years ago

FLRE do support multi regexp patterns now, see TFLRE.Create(const ARegularExpressions:array of ansistring;const AFlags:TFLREFlags=[]); constructor

But it it still a bit experimental yet, because after the DFA pass must runned a NFA verification pass at the moment yet, since the DFA finds only the single whole multi-matches boundary, and the followed NFA pass the individual patterns then. But I'll see in the next days, if I can get the individual pattern boolean detection combined into the DFA pass itself.

At least it seems that it is workingin this current DFA+NFA-combination solution already, but subsequent tests are still needed.

The result capture array (for Match, MatchNext and MatchAll) has then the followed format, the first pair is the normal match start&length pair, and the followed (second, third, etc.) has the followed format, start is the "regexp pattern index" , and length is the found boolean value (0 for pattern not found and 1 for pattern found).

zunzster commented 9 years ago

Wow - that was quick! I'll have to get started on my URL matcher prototype and see how it rolls.

BeRo1985 commented 9 years ago

And here some benchmarks results against regexp engines from the C/C++ world => http://vserver.rosseaux.net/stuff/flre.html and from the Pascal world => https://gist.github.com/BeRo1985/9288f21c9216b98b562e

data-man commented 9 years ago

Yet another Delphi library: https://github.com/shukomiya/skregexp

Can you update your benchmarks?

BeRo1985 commented 9 years ago

sorry, but skregexp is also much slower than FLRE, and seems to be buggy due to a different match count at a test regexp in comparsion to FLRE, PCRE, RE2, etc. see:

https://gist.github.com/BeRo1985/38b6d43e08922eb2c66f#file-flrevsskregexpw-txt