cozydev-pink / protosearch

prototype search library in pure scala
https://cozydev-pink.github.io/protosearch/
Apache License 2.0
9 stars 6 forks source link

Add a highlighter #205

Open valencik opened 6 months ago

valencik commented 6 months ago

It's important to show users their query in the context of the resulting documents. Consider the below example where the terms cats, effect, and effects are bolded in the search results display:

Screenshot 2024-03-30 at 10-38-36 cats-effect at DuckDuckGo

The design space for a highlighter is reasonably large. Lucene has several implementations. I'm hoping we can get something basic without too much trouble.

valencik commented 6 months ago

Collecting some rough thoughts here for a first attempt.

for each doc in docs
  for each fragment in doc
    score query against fragment
    update max scoring fragment for doc
  format max scoring fragment

What the heck is a fragment? Good question. Ideally it's a small enough snippet of document content that you can comfortably render it on your search engine results page. This could be "sentences", maybe it's "paragraphs", or perhaps "sections". Clearly this would need to be configurable, as it depends a lot on your document structure.

Hopefully we can reuse a lot of existing pieces here. For example, if we can get fragments for each doc then we can index the fragments as if they were documents, query that new fragment index, and take the top result. Can we prepare some of this ahead of time? If we record the fragment boundaries at indexing time, perhaps we wouldn't need to create a new fragment index during the highlighting stage.