apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.7k stars 1.04k forks source link

ComplexPhraseQuery highlight problem after rewriting using ComplexPhraseQuery.rewrite(IndexReader) [LUCENE-4743] #5808

Open asfimport opened 11 years ago

asfimport commented 11 years ago

Just want to ask an assistance using ComplexPhraseQuery. I mean, when it comes to highlighting the hits are not correct. I also started using ComplexPhraseQueryParser to support complex proximity searches.


Migrated from LUCENE-4743 by Jason Nacional, updated Oct 09 2015

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Can you provide a simple testcase that shows the problem?

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

ComplexPhraseQuery rewrites complex proximity searches into SpanQuerys. FastVectorHighlighter currently just ignores SpanQuery, I'm not sure how Highlighter behaves. I use ComplexPhraseQuery in production so I'd be happy to help trace this issue if you can provide some sample queries or some test cases.

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The std Highlighter can highlight span querys when in postion aware mode. It uses a memory index and decomposes the original query to find the matches.

asfimport commented 11 years ago

Jason Nacional (migrated from JIRA)

Thanks all for the quick response. I can provide you some sample query:

Let's say we have the following line: Make Sure Our Emails Reach Your Inbox

the query is: "(Make Sur*) Inbox"\~10

after searching, the hits are correct. but somehow "Make" is not being highlighted. Am I missing something here? here is my code.

...
Query rewrite_result = phrase.rewrite(IndexReader.open(INDEX_DIR));
QueryScorer qs_phrases = new QueryScorer(rewrite_result);
qs_phrases.setExpandMultiTermQuery(true);
highlighter = new Highlighter(htmlFormatter, qs_phrases);
highlighter.setTextFragmenter(new NullFragmenter());
highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
//get the temp text
if(text == null){
text = highlighter.getBestFragment(analyzer, "", pText);
}else{
text_temp = highlighter.getBestFragment(analyzer, "", text);
text = text_temp;
}
...

I'll start to create a test case for more info.

asfimport commented 11 years ago

Jason Nacional (migrated from JIRA)

Just an addition, I also used ComplexPhraseQueryParser as a default parser.

asfimport commented 11 years ago

Jason Nacional (migrated from JIRA)

I also have a question about rewriting ComplexPhraseQuery. Do I really need to always open an IndexReader? I mean, in our system, searching and viewing the hit document is a separate page. So what I'm doing to highlight terms (since I used ComplexPhraseQuery and it needs to be "rewritten") is to open an IndexReader.

I hope you understand my concerns. And I apologize for so many questions.

Thanks.

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

Given your above example queries yes, the IndexReader is used during rewrite to enumerate all the possible terms in a wildcard query. If your query only consisted of basic TermQuery and PhraseQuery I think you could provide a static, empty IndexReader like PostingsHighlighter does. The docs recommend reusing a single IndexSearcher to avoid some of the overhead of opening new IndexReaders every time.

asfimport commented 11 years ago

Jason Nacional (migrated from JIRA)

I tried to generate the translated query. Here it is:

spanNear([spanOr([content:make, spanOr([content:sur, content:sure, content:surely, content:surely.â, content:surer, content:surest, content:surety, content:surf, content:surface, content:surfaced, content:surfaces, content:surge, content:surged, content:surgeon, content:surgeonâ, content:surgery, content:surges, content:surgical, content:surging, content:surlier, content:surly, content:surmise, content:surmised, content:surmises, content:surmount, content:surmounted, content:surmounting, content:surname, content:surnames, content:surovsky, content:surpass, content:surpassed, content:surpassing, content:surplice, content:surplices, content:surplus, content:surprise, content:surprised, content:surprises, content:surprising, content:surprisingly, content:surrender, content:surrendered, content:surrendering, content:surrenders, content:surreptitiously, content:surround, content:surrounded, content:surrounding, content:surroundings, content:surrounds, content:suruchi, content:survey, content:surveyed, content:surveying, content:surveys, content:survival, content:survive, content:survived, content:surviving, content:sury])]), content:inbox], 10, true)

could it be possible that the problem is on the first spanOr??

asfimport commented 11 years ago

Ahmet Arslan (@iorixxx) (migrated from JIRA)

May be highlighter works without re-write after https://issues.apache.org/jira/browse/LUCENE-4728?

asfimport commented 11 years ago

Jason Nacional (migrated from JIRA)

Hi @iorixxx, What do you mean?

asfimport commented 11 years ago

Jason Nacional (migrated from JIRA)

I decided to run my script using SurroundQuery and create a custom Interpreter to convert the queries into a surround query language. But how can I enable leading wildcard query searching??

asfimport commented 9 years ago

Scott Stults (@sstults) (migrated from JIRA)

Looking at the query structure, this could be related to #3363 (problems highlighting nested span queries).