GregoryConrad / mimir

⚡ Supercharged Flutter/Dart Database
https://pub.dev/packages/mimir
MIT License
128 stars 10 forks source link

Possible to use showMatchesPosition / get match positions with results? #326

Open tgrushka opened 4 months ago

tgrushka commented 4 months ago

Hi, thanks for the library! According to Meilisearch docs, one should be able to add this attribute: showMatchesPosition, to get a list of the actual terms that matched (since they are fuzzy, they may be different from the query terms): https://www.meilisearch.com/docs/reference/api/search#show-matches-position

Would this be possible/difficult to implement? I looked at the raw (index as MimirIndexImpl).instance.milli.searchDocuments() results and didn't see anything different, just my raw documents returned, so Mimir clearly isn't discarding anything, just maybe not passing in all the parameters... and then you'd need an object wrapper to deserialize results.

I would be interested in helping implement this.

Also, looking at #10 and #231, I'm not sure of your status as to wanting to rewrite for the redb backend or not. I'm really getting into Rust big time and might be interested in helping, but it might be a huge can of worms as I have no idea what that would entail. How much of an expert do you need to be to be helpful with that, or are you still considering that transition? I am a huge fan of open-source, free software, not "freemium" software that might go to some kind of Business Source License like Terraform did. (I'm thinking of the Meilisearch product here, not your package, as that's part of what piques my interest in switching to a backend that would no longer be dependent on the commercial version.)

GregoryConrad commented 4 months ago

Hey! 👋

one should be able to add this attribute: showMatchesPosition, to get a list of the actual terms that matched (since they are fuzzy, they may be different from the query terms)

If this was introduced in Meilisearch v1.2 or earlier, then yes it should be possible to add. The version of milli (the underlying engine behind Meilisearch) bundled with Mimir is from that of Meilisearch v1.2.

I would be interested in helping implement this.

Go for it! Would be happy to review a PR. (Assuming the field was added in v1.2 or earlier, otherwise, things get a bit trickier due to backwards compatibility concerns).

but it might be a huge can of worms

...yup it's a decently big can of worms. For context:

While LMDB is very performant and battle-tested, it is also a headache. It has numerous tradeoffs in the name of performance, the biggest of them being memory-mapped files:

Another big limitation is LMDB's somewhat complicated concurrency model, which (without a lot of effort) prevents the ability to use macOS' app sandbox (#101).

Thus, the idea here is simple: get rid of LMDB in favor of a library that supports all of the above while still being reasonably performant. After looking into a lot of options, redb (written entirely in Rust) seems like the safest option. I added WASM/WASI support to it myself, so I know that is covered 😉. And it doesn't support access across multiple processes, instead just opting for regular Mutexes. So problem solved! It also has been used in production for other applications (AFAIK), so should be safe enough to use.

So, the question arises: how do we switch to redb? We have two options:

  1. Refactor milli to use redb instead of heed. First of all, the Meilisearch team would likely not suggest such a change since redb hasn't been battle-tested as much as LMDB and has slightly worse performance. Second of all, this would be a huge amount of effort since many parts of milli have ties to the heed codebase.
  2. Add an opt-in backend to heed to use redb instead of LMDB. IIRC the Meilisearch said they would consider this change under an opt-in feature toggle, since it wouldn't impact any of the work they're doing. It would just make other applications integrating with heed be able to have a safe backend that can compile to more platforms with ease.

How much of an expert do you need to be to be helpful with that

Eh. I think with some basic Rust knowledge you could pick it up as you go. The idea here is that you just need to replicate the heed api by just adapting all the calls to the underlying redb implementation instead of going over FFI to LMDB. That could be done via a PR over at heed. Applicable issue: https://github.com/meilisearch/heed/issues/162

If they end up not accepting a PR for that, we could always just maintain a fork of heed/milli for use in Mimir.

tgrushka commented 4 months ago

Looks like they have these structs:

pub struct MatchingWords {
    word_interner: DedupInterner<String>,
    phrase_interner: DedupInterner<Phrase>,
    phrases: Vec<LocatedMatchingPhrase>,
    words: Vec<LocatedMatchingWords>,
}

struct ScoreWithRatioResult {
    matching_words: MatchingWords,
    candidates: RoaringBitmap,
    document_scores: Vec<(u32, ScoreWithRatio)>,
    degraded: bool,
    used_negative_operator: bool,
}

In version 1.2.0:

milli/src/search/new/matches/matching_words.rs

and milli/src/search/hybrid.rs

Are you using this "hybrid" search thing/feature? I really have never heard of this search software before, so have no idea how it works.

GregoryConrad commented 4 months ago

If I’m not mistaken, the hybrid search incorporates semantic search, which was added as experimental in 1.2. But to answer your question: no, mimir doesn’t use semantic search.

But ya if matching words is a thing in 1.2 feel free to PR to expose the API on the Dart side