J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
966 stars 152 forks source link

update of the introduction #185

Closed karpanGit closed 1 year ago

karpanGit commented 1 year ago

I feel the logistic regression and the ECM example code need both to the updated to reflect the latest API.

I managed to get the logistic regression working with

df_a = pd.DataFrame({'name':['Panos', 'George', 'Maria', 'Panos'], 'age':[10, 20, 30, 40]}, index=['a1', 'a2', 'a3', 'a4']) df_b = pd.DataFrame({'name':['Panoz', 'Georgi', 'Maria', 'Panos'], 'age':[11, 22, 33, 40]}, index=['b1', 'b2', 'b3', 'b4'])

indexer = recordlinkage.Index()
# indexer.block('name')
indexer.full()
# uniqueness of indexes is ensured
candidate_links = indexer.index(df_a, df_b)

compare = recordlinkage.Compare()
compare.string('name', 'name', method='jarowinkler', threshold=0.85) compare.numeric('age', 'age')
compare_vectors = compare.compute(candidate_links, df_a, df_b)

# fit a logistic regression classifier
true_linkage = pd.Series(np.where((compare_vectors[0]>=1.) & (compare_vectors[1]<=0.5), 'same', 'different'), index=compare_vectors.index) logrg = recordlinkage.LogisticRegressionClassifier() logrg.fit(compare_vectors, true_linkage[true_linkage=='same'].index)

For the ECM the class BernoulliEMCClassifier does not seem to exist. Do you mean the ECM class?