Open asfimport opened 11 years ago
Ryan Lauck (migrated from JIRA)
Added a patch with two test cases that reproduce the issue
Ryan Lauck (migrated from JIRA)
Tricky problem! I created a patch and modified my test cases (deleted the old test case patch).
I'd appreciate any feedback, my solution seems durable and passes all highlighter test cases but I took a slightly different approach to finding the longest matching phrase.
Also a bonus idea! The current addIfNoOverlap method assumes that we would never want overlapping highlights and throws them out. A better approach might be to allow the user to provide a delegate that can add/modify overlapping WeightedPhraseInfo, some possible implementations could be:
Ryan Lauck (migrated from JIRA)
I hope I'm not stepping on any toes here, but I realized my patch is similar to some of the work done in #5190. My patch also solves the bug where repeated terms in a proximity query cause highlight matching to fail.
I also took a different approach to handling reverse order matching on slop queries so that this patch could be a complete alternative to #5190. I modified QueryPhraseMap.add to detect PhraseQuerys with slop and create a second mapping for the phrase terms in reverse order - this way no other code needs to change to handle proximity phrase terms appearing in reverse order.
I added two simple test cases for both reverse ordering and repeated terms.
Ryan Lauck (migrated from JIRA)
Store the max possible slop on the QueryPhraseMap rather than the entire FieldQuery. This limits unnecessary matching when a PhraseQuery with a large slop is combined with other PhraseQuerys.
Also, I added a fragment of slop recalculation code from WeightedSpanTermExtractor that handles PhraseQuerys with position gaps. The most common way this is encountered is by searching a phrase that contains stop words while using an analyzer that filters them.
Also cleaned up the test cases a little, and added a few comments.
Adrien Grand (@jpountz) (migrated from JIRA)
Ryan, I iterated over your patch in order to be able to handle a few more queries, specifically phrase queries that contain gaps or have several terms at the same position.
It is very hard to handle all possibilities without making the highlighting complexity explode. I'm looking forward to #3952 so that highlighting can be more efficient and doesn't need to duplicate the query interpretation logic anymore.
Ryan Lauck (migrated from JIRA)
Thanks Adrien!
I agree about #3952. I came to the same conclusion before finding that someone had already done most of the work that the ideal scenario is to (optionally) pull postings or term vectors in addition to payloads while scoring and expose them for highlighting. I'm looking forward to that patch too!
An idea I began working on but haven't polished enough to submit a patch for:
Users of the API could access raw highlight metadata (offsets and positions) and could additionally process to merge/filter/ignore overlapping highlights - one flaw I've had to work around in existing highlighters is that when highlights overlap they either merge them or toss all but the first encountered. We perform the highlighting manually in our system and hope to one day allow end users to toggle which terms are highlighted without having to make round-trips to the server to modify the search criteria and rerun the highlighter. With raw offset data this is trivial and merging/discarding overlaps can be handled in client-side code. There are additional advantages too such as being able to highlight find-in-page or search-within-search results and only having to transfer new offset metadata rather than the entire text over the wire (we have some very big 100MB+ documents).
Adrien Grand (@jpountz) (migrated from JIRA)
Hey Ryan, I think the use-case you are describing will be possible. However this will require some care because offsets computed by Lucene's analysis API are offsets for UTF16-encoded content (Java's internal encoding). So if your client code' programming language has a different internal encoding, you will need to perform conversions (this is not a fundamental problem, just something to be aware of in order not to get bad surprises).
ASF subversion and git services (migrated from JIRA)
Commit 1504862 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1504862
LUCENE-4734: Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.
ASF subversion and git services (migrated from JIRA)
Commit 1504863 from @jpountz in branch 'dev/branches/branch_4x' https://svn.apache.org/r1504863
LUCENE-4734: Add FastVectorHighlighter support for proximity queries and phrase queries with gaps or overlapping terms.
Adrien Grand (@jpountz) (migrated from JIRA)
The approach I used can be memory-intensive when there are many positions that have several terms, here is a fix.
ASF subversion and git services (migrated from JIRA)
Commit 1505731 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1505731
LUCENE-4734: Better memory efficiency.
ASF subversion and git services (migrated from JIRA)
Commit 1505732 from @jpountz in branch 'dev/branches/branch_4x' https://svn.apache.org/r1505732
LUCENE-4734: Better memory efficiency.
Ryan Lauck (migrated from JIRA)
I'm curious what you think about the comment I had in my original patch before calling addIfNoOverlap. It seems wasteful to traverse the phrase candidate list from the beginning every iteration to search for overlaps, and also prevents gathering the raw highlights as mentioned in my previous comment. What do you think about waiting until all phrase candidates are gathered to optionally filter overlaps?
Adrien Grand (@jpountz) (migrated from JIRA)
I agree this seems wasteful. Maybe we could open an issue about it?
Steven Rowe (@sarowe) (migrated from JIRA)
Bulk move 4.4 issues to 4.5 and 5.0
Jack Krupansky (migrated from JIRA)
Looking at some highlighter code, I see this constructor in org.apache.lucene.search.vectorhighlight.FieldPhraseList.java of branch_4x:
/**
* a constructor.
*
* `@param` fieldTermStack FieldTermStack object
* `@param` fieldQuery FieldQuery object
* `@param` phraseLimit maximum size of phraseList
*/
public FieldPhraseList( FieldTermStack fieldTermStack, FieldQuery fieldQuery, int phraseLimit ){
final String field = fieldTermStack.getFieldName();
QueryPhraseMap qpm = fieldQuery.getRootMap(field);
if (qpm != null) {
LinkedList<TermInfo> phraseCandidate = new LinkedList<TermInfo>();
extractPhrases(fieldTermStack.termList, qpm, phraseCandidate, 0);
assert phraseCandidate.size() == 0;
}
}
Clearly phraseLimit is no longer used. Is it being deprecated, or is this simply work in progress that will use it again eventually?
This parameter is passed over several layers of code, ultimately it is set up in Solr using the hl.phraseLimit parameter.
Seems like a "dead parameter" that should be cleaned up now or deprecated for future cleanup, but I can't say that I have been able to follow all of the work that has transpired in the highlighters.
The change occurred in Revision 1505732 (related to this Jira.) Before then, this parameter was used.
Comments? Or should this be a separate Jira issue?
Adrien Grand (@jpountz) (migrated from JIRA)
I've encountered several issues recently because of this patch I committed and I am considering reverting it. The issue is that it makes the runtime of FastVectorHighlighter very bad and this just doesn't work when the text to highlight has lots of occurrences of the query terms.
I spent time trying to improve it but I couldn't find a way to make it work nicely. I wouldn't feel good to have a release of Lucene out with this patch in it so I will revert the patch tomorrow if there is no objection.
Ryan Lauck (migrated from JIRA)
I did some performance testing of my original patch so I'd be happy to take a look at yours as well and see if I can come up with anything to help. Another idea I explored but never submitted was to build a finite state automata out of the query terms (each term is a state) and use that to highlight. Benchmarks were unstable but it seemed like a workable solution for complex queries.
The real question is: does it make more sense to invest time in #3952 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases...
Adrien Grand (@jpountz) (migrated from JIRA)
The real question is: does it make more sense to invest time in #3952 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases...
This is my feeling too.
Simon Willnauer (@s1monw) (migrated from JIRA)
The real question is: does it make more sense to invest time in #3952 rather than further complicating FVH? FVH works great for simple phrase and single term queries but it has so many corner cases..
+1 lets do it
+1 to revert the change!
ASF subversion and git services (migrated from JIRA)
Commit 1520536 from @jpountz in branch 'dev/trunk' https://svn.apache.org/r1520536
Revert LUCENE-4734.
ASF subversion and git services (migrated from JIRA)
Commit 1520544 from @jpountz in branch 'dev/branches/branch_4x' https://svn.apache.org/r1520544
Revert LUCENE-4734.
Uwe Schindler (@uschindler) (migrated from JIRA)
Move issue to Lucene 4.9.
If a proximity phrase query overlaps with any other query term it will not be highlighted.
Example Text: A B C D E F G
Example Queries:
"B E"\~10 D (D will be highlighted instead of "B C D E")
"B E"\~10 "C F"\~10 (nothing will be highlighted)
This can be traced to the FieldPhraseList constructor's inner while loop. From the first example query, the first TermInfo popped off the stack will be "B". The second TermInfo will be "D" which will not be found in the submap for "B E"\~10 and will trigger a failed match.
Migrated from LUCENE-4734 by Ryan Lauck, updated May 09 2016 Attachments: lucene-4734.patch, LUCENE-4734.patch, LUCENE-4734-2.patch Linked issues:
5190
5190