andylokandy / simsearch-rs

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).
MIT License
167 stars 25 forks source link

Scores seems wrong when some words are duplicated #15

Closed NathanGrimaud closed 5 months ago

NathanGrimaud commented 7 months ago

Hi ! Thanks for this cool lib !

While working with some data I noticed that tokens are deduplicated before the scores are calculated.

In my case it returns weird results where a text with typos would return a higher score than the exact same text:

let text = "foo foo";
let mut engine = SimSearch::new_with(SearchOptions::new().stop_whitespace(true).levenshtein(false));
engine.insert(0, &text);
engine.insert(1, "foo foos");
engine.insert(2, "bar baz");
let results = engine.search(&text);
assert!(results == &[0,1]); 

Within this snipped, foo foos is given a higher score than foo foo. What would be the tradeoff to remove the dedup line ? Maybe add it as a SearchOptions.

It seems that inputs are also dedup within the insert_tokens function

andylokandy commented 7 months ago

good catch!