While working with some data I noticed that tokens are deduplicated before the scores are calculated.
In my case it returns weird results where a text with typos would return a higher score than the exact same text:
let text = "foo foo";
let mut engine = SimSearch::new_with(SearchOptions::new().stop_whitespace(true).levenshtein(false));
engine.insert(0, &text);
engine.insert(1, "foo foos");
engine.insert(2, "bar baz");
let results = engine.search(&text);
assert!(results == &[0,1]);
Within this snipped, foo foos is given a higher score than foo foo.
What would be the tradeoff to remove the dedup line ? Maybe add it as a SearchOptions.
It seems that inputs are also dedup within the insert_tokens function
Hi ! Thanks for this cool lib !
While working with some data I noticed that tokens are deduplicated before the scores are calculated.
In my case it returns weird results where a text with typos would return a higher score than the exact same text:
Within this snipped,
foo foos
is given a higher score thanfoo foo
. What would be the tradeoff to remove the dedup line ? Maybe add it as aSearchOptions
.It seems that inputs are also dedup within the insert_tokens function