apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

REGEX Pattern Search, character classes with quantifiers do not work [LUCENE-9718] #10757

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Character classes with a quantifier do not work, no error is given and no results are returned. For example \d{2} or \d{2,3} as is commonly written in most languages supporting regular expressions, simply and quietly does not work.  A user work around is to write them fully out such as \d\d or [0-9][0-9] or as [0-9]{2,3} .

 

This inconsistency or limitation is not documented, wasting the time of users as they have to figure this out themselves. I believe this inconsistency should be clearly documented and an effort to fixing the inconsistency would improve pattern searching.


Migrated from LUCENE-9718 by Brian Feldman (@bgfeldm), updated Feb 02 2021

asfimport commented 3 years ago

Brian Feldman (@bgfeldm) (migrated from JIRA)

// code placeholder
/** 
* Lucene/Automaton Regex Check  
*
* `@param` regex
* `@param` checkValue
* `@return` true if matched  
*/
public boolean luceneRegexCheck(String regex, String checkValue) {
   //import dk.brics.automaton.RegExp;
   //import dk.brics.automaton.RunAutomaton;
   //RegExp re = new RegExp(regex);
   //RunAutomaton ra = new RunAutomaton(re.toAutomaton());
   //return ra.run(regexMatches);

   CharacterRunAutomaton automaton = new CharacterRunAutomaton(new RegExp(regex).toAutomaton());
   return automaton.run(checkValue);
}

`@Test`
void REGEXTEST() { 
   String regex = "[0-9]{2,3}";
   String regexMatches = "11";

   // Lucene Automaton Regex
   assertTrue(luceneRegexCheck(regex, regexMatches), "Lucene Regex Failed to Match");
}

`@Test`
void REGEXTEST2() {
   String regex = "\\d{2,3}";
   String regexMatches = "11";

   // Lucene Automaton Regex
   assertTrue(luceneRegexCheck(regex, regexMatches), "Lucene Regex Failed to Match");
}
asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I guess we come to expect PCRE in every implementation, but this is not that. By the way, not even Java is totally compatible with Perl I think. So it's not expected that numeric quantifiers in curly braces should work - this is not a PCRE implementation.

Further, the supported syntax is clearly documented in RegExp's javadocs, and there is a pointer there from RegExpQuery:

{{ * <p>The supported syntax is documented in the {@link RegExp} class. Note this might be different

Did you try raising the issue on one of the mailing lists before opening this issue? That's usually best.

asfimport commented 3 years ago

Brian Feldman (@bgfeldm) (migrated from JIRA)

1) User level documentation upstream in Solr or ElasticSearch there is limited documentation. Receiving no error or results back from a search system, some users might simply believe no matches exist, and not that their syntax is not supported.  I did not realize it was an issue until playing around with it.

2) Besides being documented, the code can be improved, only the initial parsing code would need updating.  It does not affect logic for the running of the automaton. And since there is already code to support the character classes, logically the parsing code should be completed to support the trailing quantifiers, in order to finish the implementation for character classes.

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Thanks Brian, contributions in those areas would be welcome!