The library documentation do not provide much guidance on test/train split and cross validation. See below an implementation using KFold object in sci-kit learn.
How does the blocking strategy used is considered in the K-Fold? How to calculate f-score for X_test ?
See below, test and advise.
import recordlinkage
from recordlinkage.datasets import load_febrl2
import pandas as pd
from recordlinkage.index import Block
import warnings
import numpy
warnings.filterwarnings("ignore")
dfA,df_true_links = load_febrl2(return_links= True)
df_true_links = df_true_links.to_frame(index=False)
df_true_links.columns=['rec_id_1','rec_id_2']
df_true_links.set_index(['rec_id_1','rec_id_2'],inplace=True)
print("dataset size:", len(dfA))
# Indexation step
indexer = recordlinkage.Index()
indexer.add(Block(['given_name']))
indexer.add(Block(['surname']))
indexer.add(Block(['date_of_birth']))
candidate_links = indexer.index(dfA)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('given_name', 'given_name', method='jarowinkler', threshold=0.85,label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
features = compare_cl.compute(candidate_links, dfA)
print("comparison vector size:", len(features))
# Classification step
# 10-fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits = 10)
fscore = []
for train_index ,test_index in kf.split(features):
X_train = features.iloc[train_index]
X_test = features.iloc[test_index]
Y_train = X_train.index & df_true_links.index
Y_test = X_test.index & df_true_links.index
# Train the classifier
nb = recordlinkage.NaiveBayesClassifier()
nb.fit(X_train, Y_train)
# predict matches for the test
result_nb = nb.predict(X_test)
# fscore using the gold standard true links- is this the right way to calculate score for X_test
fscore.append(recordlinkage.fscore(Y_test,result_nb))
print("training data fold size :", len(X_train))
print("test data fold size :",len(X_test))
print("10-fold score :",fscore)
print("average f-score :",numpy.mean(fscore))
r = nb.predict(features)
score = recordlinkage.fscore(df_true_links,r)
print("overall score :", score)
Why the score on the complete dataset is different from the KFold average fscore ? In the KFold should we make prediction on the complete dataset?
The library documentation do not provide much guidance on test/train split and cross validation. See below an implementation using KFold object in sci-kit learn.
How does the blocking strategy used is considered in the K-Fold? How to calculate f-score for X_test ? See below, test and advise.
Why the score on the complete dataset is different from the KFold average fscore ? In the KFold should we make prediction on the complete dataset?