Support of regex in StringAndChars

almondtools / stringsearchalgorithms

String matching algorithms for searching a single or multiple strings in large texts

http://stringsearchalgorithms.amygdalum.net/

GNU Lesser General Public License v3.0

44 stars 4 forks source link

Support of regex in StringAndChars #1

Closed ericruel closed 8 years ago

ericruel commented 8 years ago

Hello

Do you have an idea when the support of regex would be possible I'm very interested in something similar to https://code.google.com/p/esmre/ but implemented in java

we have a lot of regex to run in tons of texts, and it's hard to parse a regex to find the relevent words to be able to identify regex that could match in a text, but generally, it could reduce a lot the number of regex to execute

almondtools commented 8 years ago

I recently worked on a regular expression solution, called com.almondtools.stringsandchars.patternsearch.GPGlushkov. Yet it is not documented, and It's only tested with some simple examples. It was released with the last maven release and you can run it like other StringSearchAlgorithms. Note that not all regex features are supported (no lookaheads, no lookbehinds, no backreferences, yet no submatch extraction).

Some more documentation will be published in the next days.

And note that GPGlushkov is a typo, the correct name would be BPGlushkov (bit parallel glushkov). Later releases will correct this typo (making them incompatible with former versions).

ericruel commented 8 years ago

good to know

I did some tests and I think we can get better performances.

would it be difficult if we have multiple regex like \bperl\b \bvisual.?basic\b \b(cat|dog)s?\b

identify needed words for each regex perl visual and basic cat or dog

and search for those words with the AhoCorasick algorithm... which is really fast :) after running the quick search in a text, we know which regexes that could match and apply the entire regex on the text... even the complex ones with lookahead and all the list of \b\s\W\D\w... etc

also, instread of returning the regex, it would be interesting to return an object attached. let's say we have an id and more info attached to each regex. fastMatch.add(regex1, object1); fastMatch.add(regex2, object2); List<ObjectCustom> result = fastMatch.findAll(text);

I know that maybe this layer should not be in StringsAndChars but rather in another project.

FYI, we have 25K+ regex to apply and I tried to bing them all together (\bperl\b)|(\bvisual.?basic\b)|(\b(cat|dog)s?\b) but at the end, the regex is so complex that it does not perform at all

What do you think?

almondtools commented 8 years ago

Please correct me if I misunderstand your suggestions:

You suggest first to search for anchor expressions and then start regular expression search anchored at these matches? This approach is mentioned in Flexible Pattern Matching in String. Eventually I will implement such an algorithm in the next step, but the implementation is harder then the one of the existing BPGlushkov.

About your suggestion to attach regexes to more detailed information. This is not the scope of StringAndChars. Yet it reminds me of another project of mine rexlex (actually it was the predecessor of StringsAndChars). Look at this piece of code:

Map<String, TokenType> patternToTypes = new HashMap();
patternToTypes.put("a", A); //any match for 'a' will return token type A
patternToTypes.put("b", B); //any match for 'b' will return token type B

DynamicLexer<TestToken> lexer = new DynamicLexer<TestToken>(patternToTypes, REMAINDER, factory); // nonmatched strings will return REMAINDER
Iterator<MyToken> tokens = lexer.lex("abc");
MyToken a = tokens.next(); // == new MyToken("a", A)
MyToken b = tokens.next(); // == new MyToken("b", B)
MyToken c = tokens.next(); // == new MyToken("c", REMAINDER)

A and B correspond to your object1 and object2. In contrast to your solution there is no search done but a parsing of the text. This approach should be quite fast (even for multiple regexes) because it uses a DFA which reads each character of the text just once (no second pass, no backtracking). This approach is probably faster than BPGlushkov but also restricted to non-overlapping simple patterns.

However, did you test your patterns with BPGlushkov? If it was to slow, can you provide a performance test? This would be quite helpful for the development of faster algorithms.

ericruel commented 8 years ago

i'll take a look at rexlex, seems interresting

you can take a look at esmre un python.. maybe you'll understand better what I mean it uses Aho's and Corasick's algo too.

but when it find the essence of a regex, it's limited to only one token so if the regex is (cat|dog) the object is always returned as a potential regex to apply also, if the regex is (cat|dog)s, the object is return as soon as there is a "s" in the text since it's the only part that is always needed in the regex.

maybe it's not the better idea, but if we can find the potential regex to apply with a DFA, and after that use any standard regex library like jregex to take advantage of all the features of the more complex regex(grouping, lookahead...) it would be nice... it's kind of hybrid solulion it's sure would work well on regex with real chars in it like "visual.basic|vb" but not on regex like (\d{3})\s*\d{3}-\d{4} for example

there are some libraries that convert regex to DFA, but the syntax is limited, that's why maybe a hybrid solution could be interesting.

I see cases where I could use that we try to match skills in texts (e.g. python, perl, vb|visual basic), based on a list of regex and each regex has an id... and for each texts, we keep the list of associated skills

also, the nlp team has often to apply a tons of regex on a text either to prepare a text (remove noise) and extracting the information... and often, each regex does not match that often, but overall, it helps to improve the quality in the long tail... all those regex has a huge cost on performances.

that's why a solution like yours (and esmre lookalike) would be interesting since the cost of adding new regex would not be that expensive.

I can try to give you some benchmarks, the the final regex i tried to apply with BPGlushkov was something like \Wperl\W|\Wjaval\W|\Wvisual.?basicl\W|\Wvb\W|.... with 20,000+ skills and the compilation of the regex took really long time... long enough to not wanted to wait the end and stopped the execution...

thanks

almondtools commented 8 years ago

The hybrid approach you mention is already part of rexlex. Yet the performance of this approach did not really convince me. Probably the compilation of the regex with rexlex is bad, too if applied to 20,000+ skills.

After thinking about BPGlushkov: You were probably right to cancel execution, Navarro and Raffinot mention that bit parallel algorithms perform bad with large patterns. They suggest using a filtration approach (similar to your suggestions) to find sets of regexes.

So I understand your problem and I will consider esmre in time.

Thank you for your ideas

almondtools commented 8 years ago

Some further questions:

are posix regexes (no lookahead/lookbehinds, no backreferences) sufficient. At this time you use \b (which is implicitly a lookahead and lookbehind). Do you think a solution without \b is thinkable?
is your skill list public available?

ericruel commented 8 years ago

it's sure that the ultimate solution with all PCRE features would be great... since the regex are created by a team, and we already use jregex that has a full support... it avoid to have to know that the syntax may be different depending on the library used in the background

yes, the list is private, sorry...they are generally very simple... like \bvisual\W?basic\b... but the most complex are like (?<!pass)\W+drug\W+test(ing)?s?

a very simple way, but probably not the best would be to split the regex on tokens everything that is not a sequence of caracters, and removing the lookbehind and stuff like that. and if every tokens are found... even without considering the order, the entire regex can be applied with a full support of the syntax

On the other hand, for the nlp team that try to find another kind of information where the list is not known, they do a lot of regex... very complex and not necessarily optimized like (\bacquisition of\b|\bincluding an?|\bdirector of|experience in|\bfor an?|coordinate with|required to join (?:an?|the)|\bis an?|, an?|\bnew\b|\b(locat|bas)(?:ion|ing|ed) at\b|\b(located|based|client|location) in\b|\bfo?unded by\b|\bmission\b|\bdirected by( the| ab?)?\b|\bfeatured on\b|\b(ranked|named)[^.]{1,40}\bby\b|\b(skilled|experienced|proficiency|proficient) (with|in)\b|\btraining for(?: an)?\b|\bleader in\b|\b(family|(?<!manage a |versee a |..lead a )group|team) of\b|\bsubsidiary of( the)?|\bparent company, |\bis affiliated with |\bhome of( the|an?)? |\bclean record on |\b[Ww]ell known|\bcollaborate[sd]? with |\bassists |\bnot(?: an)? employment with |\bshare in |\bmember of |\bdivision of |\bits |\bleader in |\baccountable for |(?: suburb of| metro(?:politan)?| west(?:ern)?(?: central)?| east(?:ern)?(?: central)?| north(?:ern)?(?: central)?| south(?:ern)?(?: central)?| central) )

But maybe this one can be address with a rexlex or re2j and the list of tokens to find if the regex has chances to match would be complex for the algo we are currently talking. But note that we have a lot of regex like this.

almondtools commented 8 years ago

My vision would be a generic algorithm that can be configured with an instance of PatternMatcher, similar to this one:

interface PatternMatcher {
  List<String> getPrefixes(int min, int max);
  List<StringMatch> match(CharProvider chars, boolean longest);
}

This interface can be implemented using jregex (or any other pattern matching framework):

match(CharProvider chars, boolean longest) does the regex matching
getPrefixes is harder since it must collect all legal prefixes of your regex (an upper bound is ok)

The Approach:

Before searching we query the patternMatcher for prefixes, e.g. getPrefixes(2, bestPrefixLength)
We search with multi string matching (e.g. AhoCorasick) for such a prefix and mark its occurence, this would be probably configurable, too
Then we check the occurence by calling match(chars, true)

This is a modification to the esmre algorithm such that it is not limited to single tokens:

getPrefixes(2,4) on cats|dogs will return ["cats","dogs"]
getPrefixes(2,4) on visual.?basic will return ["visu"]

Yet I have restricted the algorithm to prefixes, the more general one would also allow good suffixes or infixes, but a PCRE is not easily processed backwards, which would be necessary if we allow getFactors (the new getPrefixes) to return the most unique factor of some length.

Do you think this could match to your problem?

ericruel commented 8 years ago

I don't understand the need of the min and max length... is the longest not always best?

    PatternMatcher stringSearch = PatternMatcher();
    Skill skill1 = new Skill(1,"Visual Basic", "\\b(visual\\W?basic|vb)\\b",Pattern.CASE_INSENSITIVE);
    Skill skill2 = new Skill(1,"Drug Test", "\\b(?<!pass)\\W+drug\\W+test(ing)?s?\\b",Pattern.CASE_INSENSITIVE);

    //tokens should be identified at this time

    boolean tooComplexAndWillbeAppliedEveryTimes = stringSearch.add(skill1.getRegex(), skill1);
    stringSearch.add(skill2.getRegex(), skill2);
    CharProvider text = new StringCharProvider("I code in VISUAL-BASIC and i don't pass drug tests", 0);

    //find tokens from regex and identify relevant regex, and apply them to validate the entire regex

    StringFinder finder = stringSearch.createFinder(text, MatchOption.LONGEST_MATCH);
    List<Skill> all = finder.findAll();

the list should contains only skill1

tooComplexAndWillbeAppliedEveryTimes is only to be able to know the behavior and maybe change the thresholds, or explode de regex in multiple more simple regex maybe it can be a method like stringSearch.getTooComplexRegexes()

note that esmre does not apply the entire regex... it only returns the list of objects... if at least that part can be done, it would be great.

one thing that can be done better than esmre is that like I said, for a regex like cat|dog, since there is a pipe, the object is returned to be applied every time... so if we you can support a list of token for a regex, it would address that... also, if the regex is (cat|dog)s... it will match if the text contains a "s"... so. every time.. , that would be nice if there is a kind of algo like cat or dog is more deterministic than simply "s" but maybe a threshold should be present, that if there is more than X tokens for a regex... just return it anyway to be applied every time... let's say a regex with all US states (AK|AL|AR|AS|AZ|CA|CO...), maybe it can be faster in some cases.

and also, maybe for each regex, the list of tokens can be represented by a list of list (cat|dog) =>[["cat"], ["dog"]] visual.?basic|vb => [["visual", "basic"], ["vb"]]

and when all the tokens from a sublist are met, the regex is potential, it will address the case of searching only for visual and not for basic...

but I understand that it's probably not that easy, if it was easy, I would try to represent the data like with some threshold on the number of tokens

almondtools commented 8 years ago

I don't understand the need of the min and max length... is the longest not always best?

You are right by assuming that the longest is the best. But having 25k patterns will perhaps result in patterns that may match really small words. Their prefix can only be as long as their shortest match. IN the edge case, it is possible to add patterns which could match 0 or 1 character. One single pattern (that was maybe accidently inserted) will lead inevitably to performance degradation (exponential if 0, and superlinear if 1).

note that esmre does not apply the entire regex... it only returns the list of objects... if at least that part can be done, it would be great.

You mean the skills attached to the pattern, correct?

one thing that can be done better than esmre is that like I said, for a regex like cat|dog, since there is a pipe, the object is returned to be applied every time...

The above approach should solve this problem.

ericruel commented 8 years ago

If I understand well, it's sure that a regex like \w+s or for matching a phone number should not even included in the list of tokens to be match against the text with StringsAndChars and that case, yes, a minimum limit is interresting, but, I suggest to keep these regex on the side with the TooComplexRegexes and should be appended to the list of potential regex found be executed with a jregex or something like that... the goal is to limit the number the regex execute as few as possible... but there is no magic solution.

for the second question, yes... the regex (cat|dog)s may match a text because there is "s" somethere in the text, it returns the associated object... for us, it's a Skill that contains also the regex, which allows us to reapply the entire regex on the text... but the execution of the list of the regex is not done in esmre... it's our code that does that. maybe it's ok to let your code only returns the potential regexes, since in some cases, we may want to get the content of grouping, do a replacement by something else, or loop on each occurrences... if we let the PattermMatcher search for the keyword "s" apply the regex to validate that it really matches, and after that, I need to replace "dogs" by "animal"... it would not be efficient since the regex would be apply twice...

ericruel commented 8 years ago

maybe the PatternMatcher should returns a result with methods like result.getTooComplex() result.getNoQuickSearchAvailable() and result.getAllPotentials() that returns the 2 aboves + the real potentials

ericruel commented 8 years ago

I took rapidly a look to your code and I see that you parse the regex by yourself... is there a reason why ? I understand that it may be hard to parse a regex and support the full syntax I just found project that may be interesting for you for parsing the regex https://github.com/bkiers/PCREParser

let me know what do you think about it

almondtools commented 8 years ago

Maybe the min length can solved be better, I will rethink it.

I do not see the need of tooComplexRegex etc. with my approach:

if each PatternMatcher (regex) can determine its valid prefixes of length l
and each PatternMatcher can validate a given input (starting with a valid prefix) to its pattern (regex)

then there is no regex that could be too complex. This holds for posix regexes, but with some modifications also for PCREs, for example:

\\b(?<!pass)\\W+drug\\W+test(ing)?s?\\b

For the prefix this pattern is simplified. The simplification has following steps:

remove all lookaheads/lookbehinds => \\W+drug\\W+test(ing)?s?
remove all leading dont-cares => drug\\W+test(ing)?s?

The pattern matcher transforms it to its valid prefixes (e.g. with max length 4) which are: ["drug"]. Now Aho-Corasick runs and reports every occurence of "drug" to your matchers match method. The match method, knowing that the prefix was part of the simplified pattern, will first determine the position of the whole pattern (by reverse processing the skipped dont-cares and lookbehinds) and then check this pattern with the complete PCRE whether it is valid. For example match:

is called with chars ...pass drug| test... (the pipe is the current position)
sets position before the prefix ...pass |drug test... (the prefix is known to the matcher)
then reverse processes the dont-cares ("\\W") ...pass| drug test...
then reverse processes the lookbehind ...|pass drug test... (going back the length of the lookbehind)
the beginning of the full pattern is determined, now some heuristics should be used to determine the end of window to search in
then calls find on the window and fails (since "pass" is invalidating the match)

Now this is not really simple, yet with some heuristics a workable solution should be feasible.

So the JRegexPatternMatcher implementation will have to implement following tasks:

simplify the regex by removing lookaheads/lookbehinds/back references and leading dont-cares
determining the valid prefixes of the simplified regex (this affords to parse and interprete the regex, e.g. with PCREParser)
at matching time a found prefix must be extended to a window where to search the pattern
at matching time given a window the pattern must be found/validated

I took rapidly a look to your code and I see that you parse the regex by yourself... is there a reason why ?

Yes there are several reasons:

Requirements Mismatch: My focus is on efficient search and that is only possible with a DFA (requiring a posix regular expression and not a PCRE), I am not aware of a parser for posix regular expressions in java (and I did not search for them becaus of the other reasons)
Performance - using ANTLR for such a simple task is really overhead
Flexibility - a third party parser is not changeable to the features I need or want to add
Additional Dependency - maven dependencies can cause problems if brought in with different versions (if your application uses Version 2, but my framework Version 1, this would lead to bugs that I do not want to analyze).

ericruel commented 8 years ago

the tooComplexRegex was only an easy exit in some cases... (these|those)\s?(cat|dog)s? what would you search for? I guess that since there is no valid prefix, the regex will be apply every time? which is acceptable, and if the user don't want that, he should create more simpler regex maybe.... or I thought it could be useful when the regex returns soo many prefix or no prefix... mostly for debugging if we want to improve the performances and rewrite the regex... but if there is no need, it's better.

After reflexion,I'm not sure that those parts should be done in the library at matching time a found prefix must be extended to a window where to search the pattern at matching time given a window the pattern must be found/validated

it's makes sense only if we only what to know if it matches or not, but what if we want to replace, loop on every occurrences, get the content of the first, second group.... outside the library, we have to apply the regex again...

almondtools commented 8 years ago

(these|those)\s?(cat|dog)s? what would you search for? I guess that since there is no valid prefix, the regex will be apply every time?

The prefix approach is a bit more clever. You may have notices that PatternMatcher may return multiple prefixes. So given (these|those)\s?(cat|dog)s?

the length 2-prefix is ["th"]
the length 4-prefix is ["thes","thos"]
the length 6-prefix is ["these ","those ", "these\t","those\t","these\r","those\r", "these\n","those\n"]
...

AhoCorasick can match multiple strings and so we attach the same pattern to different prefixes. And if some patterns share the same prefix, even more then one pattern matcher can be attached to a prefix.

or I thought it could be useful when the regex returns soo many prefix or no prefix... mostly for debugging if we want to improve the performances and rewrite the regex... but if there is no need, it's better.

I agree, but this need not be in the framework. Checking whether a regex is to complex can be done outside. Your advantage: You will not be dependent on a third party point of view on what regex is to complex and what information to return when so.

After reflexion,I'm not sure that those parts should be done in the library at matching time a found prefix must be extended to a window where to search the pattern at matching time given a window the pattern must be found/validated

I agree with you - the implementation of this part is not part of the library. But the callbacks to the match function are.

If you do not need this the problem gets far simpler:

just compute the regex best factors (prefixes in my case, all path tokens in yours) in your program
search it with Aho-Corasick (to map Matches back to their regexes a simple HashMap would be sufficient)
and extend the matches again in your program

it's makes sense only if we only what to know if it matches or not, but what if we want to replace, loop on every occurrences, get the content of the first, second group.... outside the library, we have to apply the regex again...

I disagree:

given a pattern without matching groups (submatches) there should be no problem to replace. Replacing a regex is more expensive than StringBuilder.replace(int from, int to, String replacement)
given a pattern with submatches, where a submatch should be replaced ... is not easy if the default StringMatch class is used. Yet I plan to return PatternMatch extends StringMatch in case of regex searches. A PatternMatch does not only contain the start and end of the match but also each submatch (with start and end). So even submatch replacement could be done on the fly.

ericruel commented 8 years ago

sure that for a single replacement, it can be done, but I mean, what if we want to do loop on every occurrence of a match in a text...

it's not clear yet for me the form all of this will take, but I trust you.. anyway, I think you want to do more than I need, so I won't complain ;)

almondtools commented 8 years ago

Currently I am preliminarily done. I fear that most work will have be done on your own. The search algorithm that can be used is MultiFactorRE. Some documentation can be found here http://almondtools.github.io/stringsandchars/.

This algorithm composes two sub algorithms:

a string search algorithm to search for small but constant factors of the pattern
a factor extender that can extend a match of a this algorithm to a complete match (or none if none exists)

To get your requirements done:

For string search you can use AhoCorasick.Factory
For a factor extender you can choose GlushkovPrefixExtender or GlushkovFactorExtender (both operating on posix regex syntax) or you implement your own subclass of FactorExtender (api doc yet not released but in source code), supporting PCREs.

The main tasks in implementing would be:

implement getBestFactor(int max): you will have to analyze your regex and then return the best factors
implement extendFactor (each result of getBestFactor) should be supported