apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

FastVectorHighlighter Overlapping Proximity Queries Do Not Highlight [LUCENE-4734] #5799

Open asfimport opened 11 years ago

asfimport commented 11 years ago

If a proximity phrase query overlaps with any other query term it will not be highlighted.

Example Text: A B C D E F G

Example Queries:

"B E"\~10 D (D will be highlighted instead of "B C D E")

"B E"\~10 "C F"\~10 (nothing will be highlighted)

This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be "B". The second TermInfo will be "D" which will not be found in the submap for "B E"\~10 and will trigger a failed match.


Migrated from LUCENE-4734 by Ryan Lauck, updated May 09 2016 Attachments: lucene-4734.patch, LUCENE-4734.patch, LUCENE-4734-2.patch Linked issues:

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

Added a patch with two test cases that reproduce the issue

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

Tricky problem! I created a patch and modified my test cases (deleted the old test case patch).

I'd appreciate any feedback, my solution seems durable and passes all highlighter test cases but I took a slightly different approach to finding the longest matching phrase.

Also a bonus idea! The current addIfNoOverlap method assumes that we would never want overlapping highlights and throws them out. A better approach might be to allow the user to provide a delegate that can add/modify overlapping WeightedPhraseInfo, some possible implementations could be:

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

I hope I'm not stepping on any toes here, but I realized my patch is similar to some of the work done in #5190. My patch also solves the bug where repeated terms in a proximity query cause highlight matching to fail.

I also took a different approach to handling reverse order matching on slop queries so that this patch could be a complete alternative to #5190. I modified QueryPhraseMap.add to detect PhraseQuerys with slop and create a second mapping for the phrase terms in reverse order - this way no other code needs to change to handle proximity phrase terms appearing in reverse order.

I added two simple test cases for both reverse ordering and repeated terms.

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

Store the max possible slop on the QueryPhraseMap rather than the entire FieldQuery. This limits unnecessary matching when a PhraseQuery with a large slop is combined with other PhraseQuerys.

Also, I added a fragment of slop recalculation code from WeightedSpanTermExtractor that handles PhraseQuerys with position gaps. The most common way this is encountered is by searching a phrase that contains stop words while using an analyzer that filters them.

Also cleaned up the test cases a little, and added a few comments.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Ryan, I iterated over your patch in order to be able to handle a few more queries, specifically phrase queries that contain gaps or have several terms at the same position.

It is very hard to handle all possibilities without making the highlighting complexity explode. I'm looking forward to #3952 so that highlighting can be more efficient and doesn't need to duplicate the query interpretation logic anymore.

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

Thanks Adrien!

I agree about #3952. I came to the same conclusion before finding that someone had already done most of the work that the ideal scenario is to (optionally) pull postings or term vectors in addition to payloads while scoring and expose them for highlighting. I'm looking forward to that patch too!

An idea I began working on but haven't polished enough to submit a patch for:

Users of the API could access raw highlight metadata (offsets and positions) and could additionally process to merge/filter/ignore overlapping highlights - one flaw I've had to work around in existing highlighters is that when highlights overlap they either merge them or toss all but the first encountered. We perform the highlighting manually in our system and hope to one day allow end users to toggle which terms are highlighted without having to make round-trips to the server to modify the search criteria and rerun the highlighter. With raw offset data this is trivial and merging/discarding overlaps can be handled in client-side code. There are additional advantages too such as being able to highlight find-in-page or search-within-search results and only having to transfer new offset metadata rather than the entire text over the wire (we have some very big 100MB+ documents).

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Hey Ryan, I think the use-case you are describing will be possible. However this will require some care because offsets computed by Lucene's analysis API are offsets for UTF16-encoded content (Java's internal encoding). So if your client code' programming language has a different internal encoding, you will need to perform conversions (this is not a fundamental problem, just something to be aware of in order not to get bad surprises).

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1504862 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1504862

LUCENE-4734: Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1504863 from @jpountz in branch 'dev/branches/branch_4x' https://svn.apache.org/r1504863

LUCENE-4734: Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The approach I used can be memory-intensive when there are many positions that have several terms, here is a fix.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1505731 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1505731

LUCENE-4734: Better memory efficiency.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1505732 from @jpountz in branch 'dev/branches/branch_4x' https://svn.apache.org/r1505732

LUCENE-4734: Better memory efficiency.

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

I'm curious what you think about the comment I had in my original patch before calling addIfNoOverlap. It seems wasteful to traverse the phrase candidate list from the beginning every iteration to search for overlaps, and also prevents gathering the raw highlights as mentioned in my previous comment. What do you think about waiting until all phrase candidates are gathered to optionally filter overlaps?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I agree this seems wasteful. Maybe we could open an issue about it?

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

Looking at some highlighter code, I see this constructor in org.apache.lucene.search.vectorhighlight.FieldPhraseList.java of branch_4x:

/**
 * a constructor.
 * 
 * `@param` fieldTermStack FieldTermStack object
 * `@param` fieldQuery FieldQuery object
 * `@param` phraseLimit maximum size of phraseList
 */
public FieldPhraseList( FieldTermStack fieldTermStack, FieldQuery fieldQuery, int phraseLimit ){
  final String field = fieldTermStack.getFieldName();

  QueryPhraseMap qpm = fieldQuery.getRootMap(field);
  if (qpm != null) {
    LinkedList<TermInfo> phraseCandidate = new LinkedList<TermInfo>();
    extractPhrases(fieldTermStack.termList, qpm, phraseCandidate, 0);
    assert phraseCandidate.size() == 0;
  }
}

Clearly phraseLimit is no longer used. Is it being deprecated, or is this simply work in progress that will use it again eventually?

This parameter is passed over several layers of code, ultimately it is set up in Solr using the hl.phraseLimit parameter.

Seems like a "dead parameter" that should be cleaned up now or deprecated for future cleanup, but I can't say that I have been able to follow all of the work that has transpired in the highlighters.

The change occurred in Revision 1505732 (related to this Jira.) Before then, this parameter was used.

Comments? Or should this be a separate Jira issue?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I've encountered several issues recently because of this patch I committed and I am considering reverting it. The issue is that it makes the runtime of FastVectorHighlighter very bad and this just doesn't work when the text to highlight has lots of occurrences of the query terms.

I spent time trying to improve it but I couldn't find a way to make it work nicely. I wouldn't feel good to have a release of Lucene out with this patch in it so I will revert the patch tomorrow if there is no objection.

asfimport commented 11 years ago

Ryan Lauck (migrated from JIRA)

I did some performance testing of my original patch so I'd be happy to take a look at yours as well and see if I can come up with anything to help. Another idea I explored but never submitted was to build a finite state automata out of the query terms (each term is a state) and use that to highlight. Benchmarks were unstable but it seemed like a workable solution for complex queries.

The real question is: does it make more sense to invest time in #3952 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases...

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The real question is: does it make more sense to invest time in #3952 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases...

This is my feeling too.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

The real question is: does it make more sense to invest time in #3952 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases..

+1 lets do it

+1 to revert the change!

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1520536 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1520536

Revert LUCENE-4734.

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1520544 from @jpountz in branch 'dev/branches/branch_4x' https://svn.apache.org/r1520544

Revert LUCENE-4734.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.