apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Wildcards, ORs etc inside Phrase queries [LUCENE-1486] #2560

Closed asfimport closed 10 years ago

asfimport commented 15 years ago

An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.

The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include:

    checkMatches("\"j\*   smyth\~\"", "1,2"); //wildcards and fuzzies are OK in phrases
    checkMatches("\"(jo\* -john)  smith\"", "2"); // boolean logic works
    checkMatches("\"jo\*  smith\"\~2", "1,2,3"); // position logic works.

    checkBadQuery("\"jo\*  id:1 smith\""); //mixing fields in a phrase is bad
    checkBadQuery("\"jo\* \"smith\" \""); //phrases inside phrases is bad
    checkBadQuery("\"jo\* [sma TO smZ]\" \""); //range queries inside phrases not supported

Code plus Junit test to follow...


Migrated from LUCENE-1486 by Mark Harwood (@markharwood), 13 votes, resolved Mar 16 2014 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch (versions: 7), Lucene-1486 non default field.patch, TestComplexPhraseQuery.java Linked issues:

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Fix for phrases using QueryParser's non-default field e.g. author:"j* smith"

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

If we don't have a clear path for this very soon I think we should pull it from this release.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

My understanding is that with "New flexible query parser" (#2641), the old QueryParser classes will be deprecated in 2.9 and removed in 3.0 (or moved to contrib in 3.0).

This change will also make ComplexPhraseQueryParser deprecated because it currently extends the old queryparser.

ComplexPhraseQueryParser was not part of any lucene release and was only checked in 2 months ago in trunk.

For the reasons above I think we should re-implement this functionality using the new flexible query parser.

3.0 and 2.9 releases will be very similar but 3.0 will have all deprecated APIs removed (at least this is my understanding).

In my view the path should be:

I'm not sure if I'll have the time to implement a compatible implementation of ComplexPhraseQueryParser before 2.9 release :(

I'm currently working on 1567 to finalize the patch, cleaning up javadocs and some small clean up to the APIs.

I'll try to work on ComplexPhraseQueryParser, once lucene-1567 is in the trunk.

So in my view, ComplexPhraseQueryParser depends on 1567, and will require some extra work after 1567 is in the trunk.

I think we have the following, options:

  1. We could wait until 1567 is in trunk and wait for a compatible implementation of ComplexPhraseQueryParser using 1567, before we release 2.9. (this would still remove the current ComplexPhraseQueryParser class, and provide this features with LuceneQueryParserHelper class, or with a new TextParser name complexphrase)
  2. We can release 2.9 with only 1567, but that will require ComplexPhraseQueryParser to be removed from trunk or at least deprecated in 2.9, and in 3.X re-implement it using the "New flexible query parser" APIs

I hope this helps :)

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay thanks. I think we should pull it for 2.9.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, so I guess the question is - who objects to pulling this from 2.9? I don't think we should release a class that extends a deprecated class and I don't think we want to hold up 2.9 waiting for an adequate non deprecated replacement.

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

No objections to pulling from core given the impending deprecation of the QueryParser base class.

I know of at least 2 folks using it so moving it to contrib would help provide somewhere to maintain fixes while we wait for the new QueryParser to incorporate the complex phrase features.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

+1 for moving it to conrib. Then the users Mark H. mentioned can consume it from a contrib jar until these features are in the new QP.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Alright, then - do you have time to handle that soon Mark H? If not I can probably make some time for it.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

patch that moves to contrib

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Reopening so we don't forget to do this one...

Come 3.0, how will this work, even in contrib? (Because the plan is to replace the old queryParser with the new one for 3.0).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The plan is to remove it and add a replacement built on the new QueryParser. The replacement may not be exactly the same, but it should be very similar.

My inclination was to leave it out of this release - its a single class and so easy to manage and plug it in separately if you want. I don't know that we should release a class that may or may not get a replacement (promises, promises ;) ) and extends a deprecated class. Contrib was once called sandbox though and consensus appeared to be to put it in contrib - so I went with that and added a warning that the class might change soon.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I'm not sure why the ComplexPhraseQuery itself is buried in the Parser. Can't the query stand on it's own? Seems like it could be a useful class outside of the specific content of a QueryParser, no?

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

It does not stand on it's own as it is merely a temporary object used as a peculiarity in the way the parsing works. The SpanQuery family would be the legitimate standalone equivalents of this class.

ComplexPhraseQuery objects are constructed during the the first pass of parsing to capture everything between quotes as an opaque string. The ComplexPhraseQueryParser then calls "parsePhraseElements(...)" on these objects to complete the process of parsing in a second pass where in this context any brackets etc take on a different meaning There is no merit in making this externally visible.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

We hope to implement this on #2898, along with other features.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move to 3.1 as this is a new feature.

asfimport commented 14 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Hi Mark,

Up to now, I was consuming ComplexPhraseQueryParser.java by means of copy paste into my source code, so I didn't notice. Today I find out that ComplexPhraseQuery.java in Lucene 2.9.1 Misc has missed the non default field.patch. It gives exception with author:"fred* smith" style queries. I am writing a solr plugin to contribute for this query parser and Solr 1.4.0 directly depends on lucene-misc-2.9.1.jar. Should I edit and include source code of ComplexPhraseQueryParser.java in my patch to solve this problem? Or is there a more convenient way to do it?

Thank you for your consideration.

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Ugh. There's probably two separate actions required here then: 1) a bug needs raising on Lucene. 2) guidance needed from the Solr team about preferred course of action

asfimport commented 14 years ago

David Kaelbling (migrated from JIRA)

Could someone link the new Lucene bug mentioned above to this issue? I couldn't find it.

asfimport commented 14 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Double Ugh. Applying the patch for the "non-default field" bug doesn't work any more because the latest ComplexPhraseQueryParser source sitting in contrib now has a different package to the QueryParser base class . This means that this subclass doesn't have the required write access to the package-protected "field" variable. This is needed to temporarily set the context of the parser when processing the inner contents of the phrase.

Fixing this would require changing the package name of ComplexPhraseQueryParser or changing the visibility of "field" in the QueryParser base class to "protected". Anyone have any strong feelings about which of these is the most acceptable?

asfimport commented 14 years ago

Terje Eggestad (migrated from JIRA)

Hi

I'm about begin using the ComplexPhraseQueryParser with 3.0.2 as we need wildcard with phrases and proximity

Our customers have a habit of including '-' in phrases which seem to trigger a bug :

If you add the following tests to the TestComplexPhraseQueryParser class:

    checkMatches("\"joe john nosuchword\"", "");  
    checkMatches("\"joe-john-nosuchword\"", "");  
    checkMatches("\"john-nosuchword smith\"", "");  

AND add a rewrite() in checkMatches() just after parse : Query q = qp.parse(qString); IndexReader reader = searcher.getIndexReader(); // need for rewrite q = q.rewrite(reader);

The first two is OK, and is rewritten to:

spanNear([name:joe, name:john, name:nosuchword], 0, true) name:"joe john nosuchword"

The third bomb out on

java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.PhraseQuery" found in phrase query string "john-nosuchword smith" at org.apache.lucene.queryParser.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:281) at org.apache.lucene.queryParser.TestComplexPhraseQuery.checkMatches(TestComplexPhraseQuery.java:120) . . .

I made a fix that seem to fixit, but I feel on very shaky ground here. I've made so many debugging hack around that I can't make a propper patch, but I added this fix to ComplexPhraseQueryParser::rewrite() just before the place the exception is thrown:

   } else {
        if (qc instanceof TermQuery) {
            TermQuery tq = (TermQuery) qc;
            allSpanClauses[i] = new SpanTermQuery(tq.getTerm());

// START FIX "A-B C" phrases } else if (qc instanceof PhraseQuery) { PhraseQuery pq = (PhraseQuery) qc; Term[] subterms = pq.getTerms();

            SpanQuery[] clauses = new SpanQuery[subterms.length];
            for (int j = 0; j < subterms.length; j++) {
                clauses[j] = new SpanTermQuery(subterms[j]);
            }
            allSpanClauses[i] = new SpanNearQuery(clauses, 0, true);

// END FIX } else {

            throw new IllegalArgumentException("Unknown query type \""
                    + qc.getClass().getName()
                    + "\" found in phrase query string \""
                    + phrasedQueryStringContents + "\"");
        }
asfimport commented 12 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

changing the visibility of "field" in the QueryParser base class to "protected".

This seems reasonable?

asfimport commented 12 years ago

Tomas Eduardo Fernandez Lobbe (@tflobbe) (migrated from JIRA)

Hi I'm working in this change to allow field queries. I noted that queries like: name:"de*" name:de* fail due to the exception thrown in the "rewrite" method:

    public Query rewrite(IndexReader reader) throws IOException {
      // ArrayList spanClauses = new ArrayList();
      if (contents instanceof TermQuery) {
        return contents;
      }

      // Build a sequence of Span clauses arranged in a SpanNear - child
      // clauses can be complex
      // Booleans e.g. nots and ors etc
      int numNegatives = 0;
      if (!(contents instanceof BooleanQuery)) {
        throw new IllegalArgumentException("Unknown query type \""
            + contents.getClass().getName()
            + "\" found in phrase query string \"" + phrasedQueryStringContents
            + "\"");
      }
...

By changing it to something like:

if (!(contents instanceof BooleanQuery)) {
        return contents;
}

queries like the one above work, together with all the other queries available in the unit test. Is there something I'm missing with the previous change? I know the ComplexPhraseQueryParser is not intended to be used for queries like the ones I'm proposing, but why does it needs to fail in those cases?

asfimport commented 12 years ago

Tomas Eduardo Fernandez Lobbe (@tflobbe) (migrated from JIRA)

I attached a patch with the change of my previous comment plus the change that allows fielded queries.

asfimport commented 12 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Mark's and Tomas' non default field patches are combined.

asfimport commented 12 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Thanks for looking into this, Mark and Tomas. Do you think this issue is the right place to introduce boolean inOrder parameter? Currently always inOrder=true is passed to SpanNearQuery's ctor.

asfimport commented 12 years ago

Tomas Eduardo Fernandez Lobbe (@tflobbe) (migrated from JIRA)

Ahmet, I created a Jira for the "inOrder" in the ComplexPhraseQueryParser. See #4831.

asfimport commented 11 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

The JIRA cleaning man is here. I thought this was committed long ago, but I just noticed it's open and set for 4.1. Huh? Last activity on this pretty popular issue was from @markh back in 2008!

asfimport commented 11 years ago

Dmitry Kan (migrated from JIRA)

Can someone give me a hand on this parser (despite the jira is so old)?

We need to have the NOT logic work properly in the boolean sense, that is the following should work correctly:

a AND NOT b a AND NOT (b OR c) a AND NOT ((b OR c) AND (d OR e))

Can anybody guide me here? Is it at all possible to accomplish this with this original CPQP implementation? I would not be afraid of changing QueryParser.jj lexical specification, if the task requires it.

asfimport commented 11 years ago

Dmitry Kan (migrated from JIRA)

OK, after some study, here is what we did:

we treat the AND clauses as spanNearQuery objects. So, the

a AND b

becomes %a b%\slop, where %%\ operator is an unordered SpanNear query (change to QueryParser.jj was required for this).

When there is a case of NOT clause with nested clauses:

NOT( (a AND b) OR (c AND d) ) = NOT ( %a b%\~slop OR %c d%\~slop ) ,

we need to handle SpanNearQueries in the addComplexPhraseClause method. In order to handle this, we just added to the if statement:

[code] if (qc instanceof BooleanQuery) { [/code]

the following else if statement:

[code] else if (childQuery instanceof SpanNearQuery) { ors.add((SpanQuery)childQuery); } [/code]

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Nikhil Chhaochharia (migrated from JIRA)

The patch posted by Ahmet Arslan on 8th Feb 2012 looks good to me. I have been using it in production for some time and did not find any issues.

I will request a committer to kindly look into this and help get this included into Solr 4.7. If any further work is required, then I will be happy to give it a shot.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Any interest in #6269?

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Hi Tim, of course. Do you think can we include TestComplexPhraseQuery.java in your work? To demonstrate it can replace ComplexPhraseQueryParser?

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Happily. Will add tomorrow and update patch. Thank you!

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Attached update with tests from TestComplexPhraseQuery to #6269.

asfimport commented 10 years ago

Nikhil Chhaochharia (migrated from JIRA)

6269 is very interesting, thanks for pointing me to it.

However, we should still try to get LUCENE-1486 closed - most of the work has already been done and it may be useful in certain cases where the full power of #6269 is not required.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Agreed. ComplexPhraseQueryParser has actually been fielded :)

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

I agree with Nikhil Chhaochharia, This fix is all about accidentally forgotten thing to an already committed code. Also I am not sure why this issue is reopened. Any feeling towards a separate issue would be a better fit for this?

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

+1

asfimport commented 10 years ago

Nikhil Chhaochharia (migrated from JIRA)

It looks like there is a problem with stopwords also - a query like "A for B" where 'for' is a stopword is parsed as "A B" and does not match documents containing "A for B".

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

OK, this seems like it's completely obsolete, any objections to closing? Should we raise Nikhil Chhaochharia's comment in a new JIRA to test at least?

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Should we raise Nikhil Chhaochharia's comment in a new JIRA to test at least?

I created #6593 for fielded query support.

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

What about the stopwords bit? yet another JIRA?

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

What about the stopwords bit? yet another JIRA?

There is no patch/solution for that in ComplexPhraseQueryParser. Tim says about the topic :

The root of this problem is that SpanNearQuery has no good way to handle stopwords in a way analagous to PhraseQuery.

I suggested Nikhil Chhaochharia to use a modified StopwordFilter ( I sent the filter to him offlist) that does not remove but instead reduces given stop words to an impossible token. "the" => "ImpossibleToken" "a" => "ImpossibleToken" "for" => "ImpossibleToken"

I think we don't need a jira for this functionality but we can document this as limitation and workaround for this.

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

One thing is , @tflobbe has reported one problem in his comment . And he has provided a solution too. Its about "fred*" kind of queries. There is only one term inside quotes.

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

One last thing that might come out of this jira is Terje Eggestad this comment and his fix. However I couldn't re-produce the problem he reported with new MockAnalyzer(random()); This problem could be analyzer specific.

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

The re-opens were from 2009. This stuff has been in Lucene for some time, and the comment "Reopening so we don't forget to do this one" makes me think this should have been closed a long time ago.

NOTE: we're also doing more work with this in the 4.8 time frame, thus it's getting some attention now.

asfimport commented 10 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

Thanks Erick Erickson for closing this beast :)

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Close issue after release of 4.8.0