t1: Table
t2: Table
t1 = t1.mutate(terms=mismo.clean.ngrams(_.street1))
t2 = t2.mutate(terms=mismo.clean.ngrams(_.street1))
# can accept *args? maybe slightly different API
# Can accept a term count table directly?
# This is a transform step, I want the API to be immutable so there
# will be no in-place mutation, so we need to have the thing trained with
# weights upon creation.
tfidf = mismo.text.TfidfTransformer(t1.terms, t2.terms, use_idf=True)
t1 = t1.mutate(weighted_terms = tfidf(t1.terms))
t2 = t2.mutate(weighted_terms = tfidf(t2.terms))
blocked = mismo.block.block_one(t1, t2, ....)
# not sure if this is a map<string, float>
# or a array<struct<term: string, weight: float>>
# or a Table?
similarity = mismo.sparse_cosine(blocked.weighted_terms_l, blocked.weighted_terms_r)
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer
sample usage: