NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
13 stars 3 forks source link

Add TF-IDF comparer based on sklearn #31

Open NickCrews opened 6 months ago

NickCrews commented 6 months ago

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

sample usage:


t1: Table
t2: Table

t1 = t1.mutate(terms=mismo.clean.ngrams(_.street1))
t2 = t2.mutate(terms=mismo.clean.ngrams(_.street1))

# can accept *args? maybe slightly different API
# Can accept a term count table directly?
# This is a transform step, I want the API to be immutable so there
# will be no in-place mutation, so we need to have the thing trained with
# weights upon creation.
tfidf = mismo.text.TfidfTransformer(t1.terms, t2.terms, use_idf=True)
t1 =  t1.mutate(weighted_terms = tfidf(t1.terms))
t2 =  t2.mutate(weighted_terms = tfidf(t2.terms))
blocked = mismo.block.block_one(t1, t2, ....)
# not sure if this is a map<string, float>
# or a array<struct<term: string, weight: float>>
# or a Table?
similarity = mismo.sparse_cosine(blocked.weighted_terms_l, blocked.weighted_terms_r)