Returning the match indexes along with results?

mfkp commented 2 years ago

Hello, first of all, thanks for publishing this, looks to be a very interesting gem.

I'm wondering if it would be possible to return the matched text (or index of matched characters) along with the results? This would be useful in cases where a fuzzy match returns results, and I would want to highlight the matching text (or just show a snippet of text around the matching text for context) in the results.

In the readme it says:

"You may have noticed that search method returns only documents ids. This is by design. The documents themselves are not stored in the index."

So maybe this is not possible, but I figured I would ask anyway:

From your example:

brother = {
  imdb_id: "tt0118767",
  type: "/crime/Russia",
  title: "Brother",
  description: "An ex-soldier with a personal honor code enters the family crime business in St. Petersburg, Russia.",
  duration: 99,
  rating: 7.9,
  release_date: Date.parse("December 12, 1997")
}

Right now, this is how the search returns:

index.search('bersonal coder', fuzzy_distance: 1)
=> ["tt0118767"]

It would be great to return something like:

index.search('bersonal coder', fuzzy_distance: 1)
=> 
[{
  "tt0118767": {
    match_ranges: [[21, 28], [36, 39]]
  }
}]

That way I could display the results in my search listing like:

An ex-soldier with a personal honor code enters the family crime business in St. Petersburg, Russia.

I haven't dug into the source yet to see if this is possible, but I figure you'd know the limitations better and might be able to provide some input on if this is feasible.

mfkp commented 2 years ago

I'm guessing somewhere around here, we would need to have the ability to add STORED as an option on text fields:

https://github.com/baygeldin/tantiny/blob/0320b4145d45799bf8340af178d989954d6818f2/src/index.rs#L70-L71

https://github.com/quickwit-oss/tantivy/blob/main/src/schema/text_options.rs#L27-L32

mfkp commented 2 years ago

Here's an example I found of using highlighted snippets:

https://github.com/quickwit-oss/tantivy/blob/eca6628b3cb6dbfdc75a441889367aa1fd58c2e1/examples/snippet.rs

baygeldin commented 2 years ago

I haven't thought about this, but it's a valid usecase. It's definitely feasable, but it would significally change the API, so the correct approach requires more thought.

First of all, as you correctly said for this to work the text fields should be stored. I don't want it to be the default behavior because storing fields in the index takes space and reading stored fields is not free (as Tantivy documentation puts it Reading the stored fields of a document is relatively slow. (100 microsecs)). So, this should be opt-in, maybe a stored option for text fields.

As performance goes, I would also make calculating ranges optional (e.g. index.search(query, with_match_ranges: true)).

Also, note that in your example there is only one text field, but there might be more, so we need to return match_ranges for every stored text field:

index.search('bersonal coder', fuzzy_distance: 1)
=> 
[
  {
    id: "tt0118767",
    match_ranges: {
      text_field_1: [[21, 28]],
      text_field_2: [[36, 39]],
    }
  }
]

And speaking of the search method: since we need to return additional metadata with every document, it would be best to create a new entity (e.g. Tantiny::SearchResult) that would contain this metadata along with documents ids, instead of returning an arbitrary hash.

That being said, modifying the source in your fork for your specific usecase shouldn't be very difficult. You would need to add the STORED index option, and modify the search method both in Rust and Ruby.

baygeldin / tantiny

Returning the match indexes along with results? #3