J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
949 stars 152 forks source link

K-Fold Cross validation in Record Linkage #120

Open mayerantoine opened 4 years ago

mayerantoine commented 4 years ago

The library documentation do not provide much guidance on test/train split and cross validation. See below an implementation using KFold object in sci-kit learn.

How does the blocking strategy used is considered in the K-Fold? How to calculate f-score for X_test ? See below, test and advise.

import recordlinkage
from recordlinkage.datasets import load_febrl2
import pandas as pd
from recordlinkage.index import Block
import warnings
import numpy
warnings.filterwarnings("ignore")

dfA,df_true_links = load_febrl2(return_links= True)
df_true_links = df_true_links.to_frame(index=False)
df_true_links.columns=['rec_id_1','rec_id_2']
df_true_links.set_index(['rec_id_1','rec_id_2'],inplace=True)

print("dataset size:", len(dfA))
# Indexation step
indexer = recordlinkage.Index()
indexer.add(Block(['given_name']))
indexer.add(Block(['surname']))
indexer.add(Block(['date_of_birth']))
candidate_links = indexer.index(dfA)

# Comparison step
compare_cl = recordlinkage.Compare()

compare_cl.string('given_name', 'given_name', method='jarowinkler', threshold=0.85,label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

features = compare_cl.compute(candidate_links, dfA)
print("comparison vector size:", len(features))

# Classification step
# 10-fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits = 10)
fscore = []
for train_index ,test_index in kf.split(features):
    X_train = features.iloc[train_index]
    X_test  = features.iloc[test_index]
    Y_train = X_train.index & df_true_links.index
    Y_test = X_test.index & df_true_links.index
    # Train the classifier
    nb = recordlinkage.NaiveBayesClassifier()
    nb.fit(X_train, Y_train)
    # predict matches for the test
    result_nb = nb.predict(X_test)
    # fscore using the gold standard true links- is this the right way to calculate score for X_test
    fscore.append(recordlinkage.fscore(Y_test,result_nb))
print("training data fold size :", len(X_train))
print("test data fold size :",len(X_test))
print("10-fold score :",fscore)
print("average f-score :",numpy.mean(fscore))

r = nb.predict(features)
score = recordlinkage.fscore(df_true_links,r)
print("overall score :", score)

Why the score on the complete dataset is different from the KFold average fscore ? In the KFold should we make prediction on the complete dataset?

Dragut commented 4 years ago

I reckon the split has lost a lot of information to get the correct weights of estimates of log likelihoods. The same applies to parallelism.