Open valencik opened 6 months ago
Collecting some rough thoughts here for a first attempt.
for each doc in docs
for each fragment in doc
score query against fragment
update max scoring fragment for doc
format max scoring fragment
What the heck is a fragment
? Good question.
Ideally it's a small enough snippet of document content that you can comfortably render it on your search engine results page.
This could be "sentences", maybe it's "paragraphs", or perhaps "sections".
Clearly this would need to be configurable, as it depends a lot on your document structure.
Hopefully we can reuse a lot of existing pieces here.
For example, if we can get fragment
s for each doc
then we can index the fragments
as if they were documents, query that new fragment index, and take the top result.
Can we prepare some of this ahead of time? If we record the fragment
boundaries at indexing time, perhaps we wouldn't need to create a new fragment
index during the highlighting stage.
It's important to show users their query in the context of the resulting documents. Consider the below example where the terms
cats
,effect
, andeffects
are bolded in the search results display:The design space for a highlighter is reasonably large. Lucene has several implementations. I'm hoping we can get something basic without too much trouble.