Wildcards, ORs etc inside Phrase queries [LUCENE-1486]

asfimport commented 15 years ago

An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.

The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include:

    checkMatches("\"j\*   smyth\~\"", "1,2"); //wildcards and fuzzies are OK in phrases
    checkMatches("\"(jo\* -john)  smith\"", "2"); // boolean logic works
    checkMatches("\"jo\*  smith\"\~2", "1,2,3"); // position logic works.

    checkBadQuery("\"jo\*  id:1 smith\""); //mixing fields in a phrase is bad
    checkBadQuery("\"jo\* \"smith\" \""); //phrases inside phrases is bad
    checkBadQuery("\"jo\* [sma TO smZ]\" \""); //range queries inside phrases not supported

Code plus Junit test to follow...

Migrated from LUCENE-1486 by Mark Harwood (@markharwood), 13 votes, resolved Mar 16 2014 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch (versions: 7), Lucene-1486 non default field.patch, TestComplexPhraseQuery.java Linked issues:

2641
- 2898
- SOLR-1604
- 6269
- SOLR-7466
- SOLR-1604

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Junit test

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

QueryParser extension

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Added tests for range queries and plain PhraseQueries

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Fixed bug with plain phrase query, added support for range queries

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

(Added 2.9 fix version in addition to 2.4.1).

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Added support for "Nots" in phrase queries e.g. "-not interested"

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

More tests for Nots

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Updated to cater for phrase clauses that produce no matches

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Updated Junit test to test for phrases with clauses that produce no matches

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

What do you think about this for 2.9 Mark H?

The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax.

That leads me to think we might want to push to 3.0? Or have you moved beyond that with all of these updates?

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Perhaps "hacky" was too strong a word. I think it's a reasonable approach to handling the complexity involved in this logic.

A colleague of mine has this running in production on a big installation with lots of users

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Is there some reason not to include this in QueryParser instead? Ie, it accepts a superset of QueryParser's current syntax?

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

The primary reason (and perhaps not a particularly good one) was I didn't want to wade around in the Javacc syntax of the .jj file that generates the QueryParser and the required extensions could be made in a subclass.

Also there is invariably a performance hit for supporting things like wildcards in phrase queries so rather than adding another "off by default" flag in the main parser and conditional logic to test if "wildcards etc in phrases" are allowed, the subclass could be seen as a specialised extension that is to be used by those that understand the trade-offs between functionality and performance.

I can sympathise with the purist approach of having all parser syntax defined in Javacc though.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Should this go in contrib rather than core? That seems to have been the approach so far, any reason to vary it up here?

Well, actually, looks like I see the multi field parser in core. Makes sense to put subclasses there I guess.

You think this is ready to commit Mark? If so, I should be able to review it (unless you want to commit it yourself).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Reformatted to lucene formatting, removed author tag, removed a couple unused fields, changed to patch format

Tests don't pass because it doesnt work quite correctly with the new constantscore multi term queries yet.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hey Mark, this doesn't work correctly with the new constant score mode. I'm hesitant to put something in core that only works with boolean expansion.

I'm not sure what needs to be done (I started and realized my interest wasn't high enough). Could you update this? Otherwise I'm tempted to push off to 3.0...

Unless another brave soul steps of course. Or I may jump back in - my brain is fickle.

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Added fix for ConstantScoreQuery changes

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

The fix was relatively straight-forward from what I could see. Just temporarily unset the QueryParser's ConstantScoreRewrite mode when performing the pass that is just evaluating query elements inside phrase queries. These clauses need to resolve to traditional BooleanQuery-full-of-termQueries in order that they can be inspected and rewritten as Span equivalents for complex phrases.

Should do the job.

Cheers Mark (Been far too busy with other things and missing getting my hands dirty here with Lucene!)

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Figured thats all it would take. I just was feeling a bit too lazy to try and understand the whole class after I put it up in front of me for a few seconds :) Figured I'd try and pawn off a piece. I made some adjustments to the patch last time, but they were basically cosmetic.

Looks like I didnt escape much work this time though - I'll review and commit shortly.

Thanks a lot.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Whoops - almost let some 1.5 slip by: throw new IllegalArgumentException(pe.getMessage(), pe) is not in 1.4.

Last patch. I'll commit later today.

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Hi Mark, Mind if I try committing this patch? I've just switched from PC to Mac and my dev environment is all changed (Subclipse vs TortoiseSvn etc) and I wouldn't mind checking my config and commit rights still work in this new environment. If anyone has any mac/subclipse-related "gotchas" I should be aware of, do let me know.

Cheers Mark

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Please, by all means ! :)

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Committed in 791579 - http://svn.apache.org/viewvc?rev=791579&view=rev

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

Hi,

I'm trying to understand what kind of syntax this query parser supports. I read the code and it does not say much. Is there any documentation (wiki, javadoc, etc) that specifies the syntax? Because it's not clear for me.

Thanks in advance, Adriano Crestani Campos

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

You might check the test class - it has a few basic examples. Its not much different than whats posted in the summary:

Just experiment.

checkMatches("\"john smith\"", "1"); // Simple multi-term still works
checkMatches("\"j* smyth\~\"", "1,2"); // wildcards and fuzzies are OK in
// phrases
checkMatches("\"(jo* -john) smith\"", "2"); // boolean logic works
checkMatches("\"jo* smith\"\~2", "1,2,3"); // position logic works.
checkMatches("\"jo* [sma TO smZ]\" ", "1,2"); // range queries supported
checkMatches("\"john\"", "1,3"); // Simple single-term still works
checkMatches("\"(john OR johathon) smith\"", "1,2"); // boolean logic with
// brackets works.
checkMatches("\"(jo* -john) smyth\~\"", "2"); // boolean logic with
// brackets works.
// checkMatches("\"john -percival\"", "1"); // not logic doesn't work
// currently :(.
checkMatches("\"john nosuchword*\"", ""); // phrases with clauses producing
// empty sets
checkBadQuery("\"jo* id:1 smith\""); // mixing fields in a phrase is bad
checkBadQuery("\"jo* \"smith\" \""); // phrases inside phrases is bad

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

Thanks for the quick response Mark!

OK, I'm trying now to figure out what is supported reading the junits only, and I ran into some issues:

What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase (I'm not sure either what it would be, because there is no open and close phrase delimiter), all I see is a phrase <"jo*">, followed by a term <smith> and an empty phrase <" ">. And the check passes because the query parser throws an exception complaning about the empty phrase, it seems to not be supported. I just changed the empty phrase to a valid phrase and the query works (failing the test case). But as I said, I'm not sure what you were exactly trying to do there, could you give me more explation about that?

I'm also getting a java.util.ConcurrentModificationException when I type an escaped double quotes inside phrases. So, I suppose it's not supported, but shouldn't it throw a better exception?

I also have an issue with the parse exceptions, if it comes from inside a phrase, it does not tell the correct position in the query string. I think it considers the beginning of the phrase as the beginning of the query and it only prints the phrase that contains the problem.

I'm attaching some changes I did in the TestComplexPhraseQuery junit that shows these problems I'm getting, I think it's easier to understand if you read and run it.

Sorry for so many questions, but I'm just trying to understand what exactly this query parser supports or not.

Thanks, Adriano Crestani Campos

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

You may have to wait for the author, Mark Harwood to respond. I just reviewed the issue. A couple points though:

What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase (I'm not sure either what it would be, because there is no open and close phrase delimiter), all I see is a phrase <"jo*">, followed by a term <smith> and an empty phrase <" ">

Its kind of a phrase within a phrase (though the "smith" phrase could be turned into a term query) - unescaped: "jo* "smith"" - the full thing is phrase one, and smith is the inner phrase (though yes, only a term in the phrase).

If Mark Harwood doesn't have time to answer soon, I'll dig in more and respond to your other questions/comments.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Looking at the problems Adriano is seeing it almost seems like this was a bit prematurely committed? It seems like a lot of queries you could enter here are not really supported and might throw strange exceptions.

Maybe it should live in contrib for now (with experimental warnings)?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I originally thought it might live in contrib as well (see above), but I'm personally fine with it being in core.

It seems like a lot of queries you could enter here are not really supported and might throw strange exceptions.

A lot of queries? I think Adriano is just having trouble with phrases inside phrases, which is unsupported. Other things that are not supported might throw exceptions too, but I think thats to be expected? I see what Adriano was talking about now - technically the first 2 quotes would match, and then the second two - I think Mark H was just demonstrating that you shouldn't try that query though - a user might think they are quoting smith, but for the example, it doesn't matter. I think he just trying to show that you shouldn't try and "nest" phrases - even though they wouldn't be interpreted that way anyway.

It only supports a limited subset of the Lucene query language - perhaps we could improve the exceptions being thrown, but the exceptions the queryparser throws often leave just as much to be desired. I don't think its experimental because of that.

Personally, I think the class does what it intends - allows a limited subset of the Lucene query language in phrases. Though of course it could be improved.

I'll let Mark H respond though. I also don't mind seeing it moved to contrib, but I'm not sure anything glaring points to it being moved at the moment. It lives up to its limited contract I think.

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

I see what Adriano was talking about now - technically the first 2 quotes would match, and then the second two - I think Mark H was just demonstrating that you shouldn't try that query though - a user might think they are quoting smith, but for the example, it doesn't matter. I think he just trying to show that you shouldn't try and "nest" phrases - even though they wouldn't be interpreted that way anyway.

Well, if you guessed his intention correctly, the comment is misleading: "phrases inside phrases is bad". But lets wait for his response.

Other things that are not supported might throw exceptions too

I think a user would expect a ParseException. Probably, every query parser user catches ParserException and show a nice message to its final user. Now, if the query parser starts throwing random exception to say the syntax is invalid, every software that uses Lucene query parser is gonna start crashing. For me it's like if a compiler started throwing segmentation fault every time you forget a } in the code.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think a user would expect a ParseException. Probably, every query parser user catches ParserException and show a nice message to its final user. Now, if the query parser starts throwing random exception to say the syntax is invalid, every software that uses Lucene query parser is gonna start crashing. For me it's like if a compiler started throwing segmentation fault every time you forget a } in the code.

That's a fair point - addressable though - we can likely catch and rethrow in the worst case.

I'll admit, the ... non exactness ... of this parser troubled me at first - one of the reasons I liked contrib as a landing spot early on. I took it for what it is in the end I suppose. I think the shortfalls brought up so far can be addressed to a large degree though.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Well, if you guessed his intention correctly, the comment is misleading: "phrases inside phrases is bad". But lets wait for his response.

I think thats a bit of judgement call. We know that the way the query is parsed, you cannot really ever do "phrases inside phrases". However, a user of this parser might think, that like the other syntax, perhaps you can use "phrases inside phrases" - and if you thought that, the example given is likely how you'd imagine it to work. The outside phrase, and then the inside phrase. I certainly agree some comments would clear it up, but I think its a useful example.

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

I'll admit, the ... non exactness ... of this parser troubled me at first - one of the reasons I liked contrib as a landing spot early on. I took it for what it is in the end I suppose. I think the shortfalls brought up so far can be addressed to a large degree though.

I think contrib would be a good place for now, until it gets more stable and better documented.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think contrib would be a good place for now, until it gets more stable and better documented.

If Mark H thinks it should be moved, I won't disagree. But I still don't see a convincing reason. It could use some more documentation, but so could quite a few other classes in core. Its something of a subjective call, and more importantly, it can be addressed now.

I'm not yet convinced its unstable - the only major issue I see so far is the exception issue - but that wouldn't seem to prompt a move to contrib, but an update to address the concern. Moving to contrib is always an option, but I don't think its the default move based on whats been brought up. The standard move would be to address whatever issues are brought up ... so far I am just seeing the exception issue as a large one, and I think that is fairly easily addressable.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

It only supports a limited subset of the Lucene query language - perhaps we could improve the exceptions being thrown, but the exceptions the queryparser throws often leave just as much to be desired. I don't think its experimental because of that.

Because it only supports a limited subset of the language, I feel like we could have taken a different approach here? Why not add the features that are supported and make sense to the main query parser?

The documentation does not tell me what is supported and what is not currently. And looking through the code some methods now throw RuntimeExceptions, because the overridden methods themselves don't throw anything. These things feel a bit unfinished.

I'm not saying these issues are not fixable. But maybe we should rethink the design. My biggest concern is that this new parser doesn't seem to have a well-defined syntax. So since it doesn't check if a query is actually valid or not, it might be hard to maintain. E.g. if you add new language features to the main QP, it's currently not defined what will happen if you use them with this one.

That's why I'm proposing to move it to contrib and mark it as experimental. Then we have more time to decide if the approach of adding the new features to the main QP makes more sense.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

I share same opinion as Michael, the implementation has a lot of undefined/undocumented behaviors, simple because it reuses the queryparser to parse the text inside a phrase. All the lucene syntax needs to be accounted on this design, but it does not seem to be the case.

Problems like Adriano described, phrase inside a phrase, position reporting for errors.

I also have a lot of concerns about having the full lucene syntax inside phrases and trying to restrict this by throwing exceptions for particular cases does not seem the best design.

Here is a example of with OR, AND, PARENTESIS with a proximity search "(( jakarta OR green) AND (blue AND orange) AND black\~0.5) apache"\~10

What should a user expect from this query, without looking at the code. I'm not sure. Does it even make sense to support this complex syntax? In my opinion. no

I think we should define what is the subset of the language we want to support inside the phrases with a well defined behavior. If Mark describes all the syntax he wants to support inside phrases, I actually don't mind to implement a new parser.for this.

My view is, contrib is probably a better place to have this code, until we figure out a implementation that does not impose as many restrictions on changes to the original queryparser and describes a well defined syntax to be applied inside phrases.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

I added 2 testcases that return doc 3. These queries do not make much sense, I added it just to prove the point that we need more information describing the use case for complex phrase qp. We also should define a subset of the supported syntax we want to support inside phrases, with well defined behaviors.

checkMatches("\"(goos\~0.5 AND (mike OR smith) AND NOT ( percival AND john) ) vacation\"\~3","3"); // proximity with fuzzy, OR, AND, NOT
checkMatches("\"(goos\~0.5 AND (mike OR smith) AND ( percival AND john) ) vacation\"\~3","3"); // proximity with fuzzy, OR, AND

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I'll try and catch up with some of the issues raised here:

What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase

Correct, the "inner phrase" example was a term not a phrase. This is perhaps a better example:

    checkBadQuery("\"jo\* \"percival smith\" \""); //phrases inside phrases is bad

I'm trying now to figure out what is supported

The Junit is currently the main form of documentation - unlike the XMLQueryParser (which has a DTD) there is no syntax to formally capture the logic. Here is a basic summary of the syntax supported and how it differs from normal non-phrase use of the same operators:

Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms)
Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith"
"AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding all phrase elements

To move this forward I would suggest we consider following one of these options:

1) Keep in core and improve error reporting and documentation 2) Move into "contrib" as experimental 3) Retain in core but simplify it to support only the simplest syntax (as in my Britney\~ example) 4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, \~, ( )

I think 1) is achievable if we carefully define where the existing parser breaks (e.g. ANDs and nested brackets) 2) is unnecessary if we can achieve 1). 3) would be a shame if we lost useful features for some very convoluted edge cases 4) is beyond my JavaCC skills.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

My first thought is, if we can address some of the issues brought up, there is no reason to keep this out of core IMHO.

My second thought is, I have a feeling a lot of this concern stems from the fact that these guys (or one of them) has to duplicate this thing with the QueryParser code in contrib. That could be reason enough to move it to contrib. But it doesn't solve the issue longer term when the old QueryParser is removed. It would need to be replaced then, or dropped from contrib.

With the new info from Mark H, how hard would it be to create a new imp for the new parser that did a lot of this, in a more defined way? It seems you basically just want to be able to use multiterm queries and group/or things, right? We could even relax a little if we have to. This hasn't been released, so there is still a lot of wiggle room I think. But there does have to be a resolution with this and the new parser at some point either way.

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

Hi Mark H.,

Thanks for the response, some comments inline:

Correct, the "inner phrase" example was a term not a phrase. This is perhaps a better example:

checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad

I think you did not get what I meant, even with your new example, there is no inner phrase, it is: a phrase <"jo* ">, followed by a term <percival>, followed by another term <smith>, and an empty phrase <" ">. So, with your change, the junit passes, but for the wrong reason. It gets an exception complaining about the empty phrase and not because there is an inner phrase (I still don't see how you can type an inner phrase with the current syntax). I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test. I expect you understood what I mean, let me know if I did not make it clear.

The Junit is currently the main form of documentation

But not the ideal, because the source code (junit code) is not released in the binary release. So, the ideal place should be in the javadocs.

Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms)

Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith"

"AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding all phrase elements

Thanks, now it's clearer for me what is supported or not. I have some questions:

I understand this AND_NEXT_TO implicit operator between the queries inside the phrase. However, what happens if the user do not type any explicit boolean operator between two terms inside parentheses: "(query parser) lucene". Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)?

What happens if I type "(query AND parser) lucene". In my point of view it is: "(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document that contains the term 'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1. Is this the expected behaviour?

1) Keep in core and improve error reporting and documentation 2) Move into "contrib" as experimental 3) Retain in core but simplify it to support only the simplest syntax (as in my Britney\~ example) 4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, \~, ( )

1 is good, but I would prefer 4 too. Documentation and throw the right exception are necessary. I just don't feel confortable on the complex phrase query parser relying on the main query parser syntax, any change on the main one could easialy brake the complex phrase QP. Anyway, 4 may be done in future :)

Mark M.:

With the new info from Mark H, how hard would it be to create a new imp for the new parser that did a lot of this, in a more defined way? It seems you basically just want to be able to use multiterm queries and group/or things, right? We could even relax a little if we have to. This hasn't been released, so there is still a lot of wiggle room I think. But there does have to be a resolution with this and the new parser at some point either way.

Yes, I am working on the new query parser code. I started recently to read and understand how the ComplexPhraseQP works, so I could reproduce the behaviour using the new QP framework. I first tried to look at this QP as a user and could not figure out what exactly I can or not do with it. I think now we are hitting a big problem, which is related to documentation. That is why I started raising these question, because others could also have the same issues in future.

So, yes, I can start coding some equivalent QP using the new QP framework, I'm just questioning and trying to understand everything before I start any coding. I don't wanna code anything that wil throw ConcurrentModificationExceptions, that's why I'm raising these issues now, before I start moving it to the new QP.

Best Regards, Adriano Crestani Campos

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

I think the best thing to do here is do exactly define what syntax is supposed to be supported (which Mark H. did in his latest comment), and then implement the new syntax with the new queryparser. It will enforce correct syntax and give meaningful exceptions if a query was entered that is not supported.

I think we can still reuse big portions of Mark's patch: we should be able to write a new QueryBuilder that produces the new ComplexPhraseQuery.

Adriano/Luis: how long would it take to implement? Can we contain it for 2.9?

This would mean that these new features would go into contrib in 2.9 as part of the new query parser framework, and then be moved to core in 3.0. Also from 3.0 these new features would then be part of Lucene's main query syntax. Would this makes sense?

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Reopening this issues; we haven't made a final decision on how we want to go forward yet, but in any case there's remaining work here.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

Hi Mark H

I would like to propose 5, 5) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, \~, ( ) I propose doing this using using the new QP implementation. (I can write the new javacc QP for this) (this implies that the code will be in contrib in 2.9 and be part of core on 3.0)

I also want to propose to change the complexphrase to use single quotes, this way we can have both implementation for phrases.

Here is a summary:

the complexqueryparser would support all Lucene syntax even for phrases
and we could add singlequoted text to identify complexphrases 1) Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms) 2) Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith" 3) supported operators: OR, *, \~, ( ), ? 4) disallow fields, proximity, boosting and operators on single quoted phrases (I'm making an assumption here, Mark H please comment) 5) singlequotes need to be escaped, double quotes will be treated as regular punctuation characters inside single quoted strings

Mark H, can you please elaborate more on the these other operators "+" "-" "^" "AND" "&&" "||" "NOT" "!" ":" "[" "]" "{" "}".

Example: A query with single quoted (complexphrase) followed by a term and a normal phrase:

query: '(john OR jonathon) smith\~0.3 order*' order:sell "stock market"

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

Mark H -

Question 1)

I added a doc 5 and 6

...
  DocData docsContent[] = { new DocData("john smith", "1"),
      new DocData("johathon smith", "2"),      
      new DocData("john percival smith goes on  a b c vacation", "3"),
      new DocData("jackson waits tom", "4"),
      new DocData("johathon smith john", "5"),
      new DocData("johathon mary gomes smith", "6"),
      };
...

for test checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned. Is this the correct behavior?

Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches("\"john -percival\"", "1"); // not logic doesn't work // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

Question 3) for query: checkMatches("\"jo* smith\"\~2", "1,2,3,5"); // position logic works. doc 6 is also returned, so this feature does not seem to be working.

Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches("\"(jo* AND mary) smith\"", "1,2,5"); // boolean logic with

returns 1,2,5 and not 6, but I was only expecting 6 to be returned, seems that like the AND is converted into a OR. What is the behavior you want to implement?

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

Sorry for all the emails, I'm still new to JIRA and only now I realized that for every edit I do,a email is sent.

But now that I found the preview button, it won't happen again. :)

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test.

Granted, the test fails for a reason other than the one for which I wanted it to fail. We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported.

Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)?

In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default.

Mark H, can you please elaborate more on the these other operators "+" "-" "^" "AND" "&&" "||" "NOT" "!" ":" "[" "]" "{" "}".

OK I'll try and deal with them one by one but these are not necessarily definitive answers or guarantees of correctly implemented support

OR,||,+, AND, && ..... ignored. The implicit operator is AND_NEXT_TO apart from in bracketed sections where all elements at this level are ORed ^ .....boosts are carried through from TermQuerys to SpanTermQuerys NOT, ! ....Creates SpanNotQueries []{} ....range queries are supported as are wildcards *, fuzzies \~, ?

query: '(john OR jonathon) smith\~0.3 order*' order:sell "stock market"

I'll post the XML query syntax equivalent of what should be parsed here shortly (just seen your next comment come in)

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

for test checkMatches("\"(jo* -john) smyth\"", "2"); would document 5 be returned or just doc 2 should be returned,

I presume you mean smith not smyth here otherwise nothing would match? If so, doc 5 should match and position is relevant (subject to slop factors).

Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches("\"john -percival\"", "1"); // not logic doesn't work // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

I suppose there's an open question as to if the second example is legal (the brackets are unnecessary)

Question 3) checkMatches("\"jo* smith\"\~2", "1,2,3,5"); // position logic works. doc 6 is also returned, so this feature does not seem to be working.

That looks like a bug related to slop factor?

Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches("\"(jo* AND mary) smith\"", "1,2,5"); // boolean logic with

ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this.

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

query: '(john OR jonathon) smith\~0.3 order*' order:sell "stock market"

Would be parsed as follows (shown as equivalent XMLQueryParser syntax)


<BooleanQuery>
  <Clause occurs="should">
     <SpanNear >        
            <SpanOr>
                <SpanOrTerms>john jonathon </SpanOrTerms>
            </SpanOr>
            <SpanOr>
                <SpanOrTerms> smith smyth</SpanOrTerms>
            </SpanOr>
            <SpanOr>
                <SpanOrTerms> order orders</SpanOrTerms>
            </SpanOr>
   </SpanNear>
 </Clause>
<Clause occurs="should">
     <TermQuery fieldName="order" >sell</TermQuery>     
 </Clause>
<Clause occurs="should">
     <UserQuery>"stock market"</UserQuery >     
 </Clause>
</BooleanQuery>

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

I propose doing this using using the new QP implementation. (I can write the new javacc QP for this) (this implies that the code will be in contrib in 2.9 and be part of core on 3.0)

That would be good!

Granted, the test fails for a reason other than the one for which I wanted it to fail. We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported.

Cool, I agree to remove it. But I still don't see how an user can type a phrase inside a phrase with the current syntax definition, can you give me an example?

In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default.

If we use the implemented java cc code Luis suggested, we would have already a query parser that throws ParseExceptions whenever the user types an AND inside a phrase.

OR,||,+, AND, && ..... ignored

So we should throw an excpetion if any of these is found inside a phrase. It could confuse the user if we just ignore it.

Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches("\"john -percival\"", "1"); // not logic doesn't work // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

I suppose there's an open question as to if the second example is legal (the brackets are unnecessary)

Yes, the second is unnecessary, but I don't think it's illegal. The user could type <(smith)> outside the phrase, it makes sense to support it inside also.

Question 3) checkMatches("\"jo* smith\"\~2", "1,2,3,5"); // position logic works. doc 6 is also returned, so this feature does not seem to be working.

That looks like a bug related to slop factor?

I have not checked yet, but I think it's working fine. The slop means how many switches between the terms inside the phrase is allowed to match the query. It matches doc 6, because the term <smith> switches twice to the right and matched "johathon mary gomes smith". Twice = slop 2 :)

ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this.

I think we could support AND also. I agree there are few cases where the user would use that. It would work as I explained before:

What happens if I type "(query AND parser) lucene". In my point of view it is: "(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document that contains the term 'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1. Is this the expected behaviour?

asfimport commented 15 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Hi everyone,

I am using your ComplexPhraseQueryParser. I integrated it into Solr. I am interested in it mainly because it supports OR operator and wildcards inside proximity search.

Specifically : "(john johathon) smith"\~10 and "j* smith" They both work perfectly, thank you for your work.

I downloaded source code of it from http://svn.apache.org/viewvc?view=rev&revision=791579 And then edited the code a little bit since I am using lucene 2.4.1:

I replaced those: 1-) TermRangeQuery to RangeQuery. 2-) getConstantScoreRewrite() to getUseOldRangeQuery(); 3-) setConstantScoreRewrite(false); to setUseOldRangeQuery(true); 4-) On line 168 of ComplexPhraseQueryParser.java there are two semicolons ( ; ; )

I am not sure what I did is the way to start using this query parser with latest versions of lucene/solr. If it is not can you suggest me better ways or where to get/download latest source code of query parser.

I am having problems with multi-field searches.

Query "(john johathon) smith"\~10 works on default field, e.g. text.

But when I want to run the same query on another field (other than default field) title:"(john johathon) smith"\~10 it gives exception below: Cannot have clause for field "text" nested in phrase for field "title"

When I ran the query distibuting field name to all terms it works: title:"(title:john title:johathon) title:smith"\~10

Is there an easy way to set field of all terms (without specifying)?

And about boosts of multi-field queries, is this query legal? (default operator = OR, default field = text)

title:"(title:john title:johathon) title:smith"\~10^1.5 OR "(john johathon) smith"\~10^3.0

Shortly I want to use this queryparser to query on multi-fields with different boosts.

I am not sure if I am allowed to ask such question in here, if not please accept my apologies.

Thank you for your consideration.

Ahmet Arslan

apache / lucene

Wildcards, ORs etc inside Phrase queries [LUCENE-1486] #2560

2641

2898

6269