dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.14k stars 550 forks source link

use sqlite's fts5 for tf/idf index predicates #888

Open fgregg opened 3 years ago

fgregg commented 3 years ago

https://sqlite.org/fts5.html

fgregg commented 3 years ago

if i want to roll my own scoring https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/

fgregg commented 3 years ago

got a spike going here: https://github.com/dedupeio/dedupe/tree/sqlite_index_predicate

this uses fts5 which comes with bm25 as a default scorer. unfortunately, bm25 is not a normalized score, so we can't have threshold defined canopies.

so, we'll need to use a custom scorer. fts4 exposes "matchinfo" which makes it pretty easy to do that (a few examples from peewee).

It's also possible to write customer scorers for fts5, but i couldn't find any third party examples. Here's the bm25 "auxillary function" which could be a prototype.

fgregg commented 3 years ago

fts5 matchinfo implementation: https://github.com/sqlite/sqlite/blob/master/ext/fts5/fts5_test_mi.c