apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Highlighter based on Matches API [LUCENE-8349] #9396

Open asfimport opened 6 years ago

asfimport commented 6 years ago

I started trying to integrate the Matches API into the UnifiedHighlighter, but there's a fairly heavy impedance mismatch between the way the two of them work (eg Matches doesn't give you freqs, it's entirely lazy, the UH tries to do things by field rather than by doc). So instead, I thought I'd try and write a new highlighter based around Matches, and see what it looks like.


Migrated from LUCENE-8349 by Alan Woodward (@romseygeek), updated Jun 21 2018

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's a pull request with my first sketch: https://github.com/apache/lucene-solr/pull/397

It's very minimal, needs lots of javadocs and testing, and doesn't score passages yet, but it should give an idea of what I'm trying to do.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

cc @jimczi @dsmiley

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

@rmuir do you have a comment on the highlight by field then doc vs doc then field? I believe you chose this arrangement in the PostingsHighlighter (the ancestor of the UH) and AFAICT this is optimized for offsets in postings. I'm not sure how much it matters. And I'm surprised Matches API would have any impact on the distinction (as Alan implies it would) but I haven't looked closely at this patch yet to see.

I'll look at your PR Alan. This is lighting a fire under my but to continue #9333 – battle of the highlighters ;-)

asfimport commented 6 years ago

David Smiley (@dsmiley) (migrated from JIRA)

This highlighter is impressive for not a lot of code! Great work @romseygeek! Some observations:

BTW some complexity in the UH that I don't see here is related to query tree visiting, such as for MultiTermQueries and also for getting all the terms (granted the latter is easy and not much code). This information is put to good use by building a MemoryIndex collecting only the pertinent terms and not bothering with the rest.

If this highlighter moves forward, I figure at some point you're going to have to address visiting/walking queries (e.g. to look for MTQs) and/or perhaps rewriting them. Consider these related issues: #9232 LUCENE-8160 #4114