J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
969 stars 152 forks source link

ECM Classification Binary Label Issue; Deadline Sensitive #75

Closed LoganPletcher closed 5 years ago

LoganPletcher commented 6 years ago

Hello there, I'm running into a bit of an issue trying to use an ECMClassifier object's fit function with a vector pandas.DataFrame object. I've tried looking around but I've had no success as all the solutions provided do not relate to this recordlinkage library.

To help with the understanding of the problem, I'll describe what the dataframes I am working with are like (I can't show actual pictures of them due to work privacy). Dataframe1 contains seven columns in the following order: Block_Key, cust_name, physical_padress1, physical_address2, city, state, zip. Dataframe2 has eight columns: Block_Key, business_name, tradestyle, sec_trdstl, physical_street, physical_city, physical_state, physical_zip.

They are indexed with blocking using the Block_Key to return a list of MultiIndex.

def block_key_pairing(self, block_keys):
    df1Data = pandas.read_excel(self.Dataframe1)
    df2Data = pandas.read_excel(self.Dataframe2)
    candidate_df = pandas.DataFrame()
    pair_list = []
    for block_key in block_keys:
        bkIndexer = BlockKeyIndexer(block_key)
        pair = bkIndexer.index(df1Data, df2Data)
        for index in pair:
            candidate_df.loc[index] = block_key
        pair_list.append(pair)
    candidate_df.to_excel('C:\\Users\\Documents\\Pairs & Block Keys.xlsx')
    return pair_list

BlockKeyIndexer is a class object for custom indexing due to the regular Indexer returning pairs with different block keys because they are different by one or two characters. The function that the candidate pairs are used is shown below, and it also where I fiddled with classifications over the course of a couple days.

def comparator(self, candidate_links):
    comp = recordlinkage.Compare()
    vector_df = pandas.DataFrame()

    comp.string("cust_name", "business_name", method='levenshtein',label='Business Name')
    comp.string("cust_name", "tradestyle", method='levenshtein',label='Trade Style')
    comp.string("cust_name", "sec_trdstl", method='levenshtein',label='Secondary TS')
    comp.string("physical_address_1", "physical_street", method='levenshtein',label='Primary Address')
    comp.string("physical_address_2", "physical_street", method='levenshtein',label='Secondary Address')
    comp.string("city", "physical_city", method='levenshtein',label='City')
    comp.string("state", "physical_state", method='levenshtein',label='State')
    comp.numeric("zip", "physical_zip",label='Zip')

    df1Data = pandas.read_excel(self.Dataframe1)
    df2Data = pandas.read_excel(self.Dataframe2)

    for candidate_pair in candidate_links:
        vector_df = vector_df.append(comp.compute(candidate_pair, cdmData, dnbData))

    vector_df=vector_df.astype(np.int_)
    ecm=recordlinkage.ECMClassifier(init='jaro',binarize=0.8)
    result_ecm=ecm.fit(vector_df)
    print(len(result_ecm))

    return vector_df

So a vector dataframe is created called vector_df and when trying to use the fit function yields this error

ValueError: Only binary labels are allowed for 'jaro'method. Column 2 has 1 different labels.

I've become very perplexed by this because it seems to imply that Columns 0 & 1 are acceptable. So I figured the problem was because the values inside the vector_df weren't binary, so I tried modifying the comparator function like so.

def comparator(self, candidate_links):
    comp = recordlinkage.Compare()

    vector_df = pandas.DataFrame()

    comp.string("cust_name", "business_name", method='levenshtein',label='Business Name')
    comp.string("cust_name", "tradestyle", method='levenshtein',label='Trade Style')
    #comp.string("cust_name", "sec_trdstl", method='levenshtein',label='Secondary TS')
    comp.string("physical_address_1", "physical_street", method='levenshtein',label='Primary Address')
    comp.string("physical_address_2", "physical_street", method='levenshtein',label='Secondary Address')
    comp.string("city", "physical_city", method='levenshtein',label='City')
    comp.string("state", "physical_state", method='levenshtein',label='State')
    comp.numeric("zip", "physical_zip",label='Zip')

    df1Data = pandas.read_excel(self.Dataframe1)
    df2Data = pandas.read_excel(self.Dataframe2)

    for candidate_pair in candidate_links:
        vector_df = vector_df.append(comp.compute(candidate_pair,df1Data,df2Data))

    vector_df=vector_df.astype(np.int_)
    ecm=recordlinkage.ECMClassifier(init='jaro',binarize=0.8)
    result_ecm=ecm.fit(vector_df)
    print(len(result_ecm))

    return vector_df

However this yields a new error when running.

ValueError: could not broadcast input array from shape (12) into shape (13)

I've tried debugging and I looked inside the library to see what the process is but I still can't figure out what I need to do to the vector_df to have my code run correctly.

I am new to the recordlinkage library, having only worked with it for three weeks now.

mayerantoine commented 5 years ago

Hi all, I have the same issue when using febrl4 datasets. This can be reproduce using the code below.

import recordlinkage as rl
from recordlinkage import Block
from recordlinkage.datasets import load_febrl4,load_krebsregister,load_febrl2,load_febrl3,load_febrl1,binary_vectors
import numpy as np
from jellyfish import soundex, metaphone

df_a,df_b,df_true_links = load_febrl4(return_links= True)
df_true_links = df_true_links.to_frame(index=False)
df_true_links.columns=['rec_id_1','rec_id_2']
df_true_links.set_index(['rec_id_1','rec_id_2'],inplace=True)

##### PRE-PROCESSING 

#soundex of the firstname
df_a['sndx_given_name'] = df_a['given_name'].apply(lambda x : soundex(str(x)))
df_b['sndx_given_name'] = df_b['given_name'].apply(lambda x : soundex(str(x)))

# metaphone of the surname
df_a['mtph_surname'] = df_a['surname'].apply(lambda x : metaphone(str(x)))
df_b['mtph_surname'] = df_b['surname'].apply(lambda x : metaphone(str(x)))

# split date_of_birth
df_a['YearB'] = df_a['date_of_birth'].str[:4].astype(str)
df_a['MonthB'] = df_a['date_of_birth'].str[5:7].astype(str)
df_a['DayB'] = df_a['date_of_birth'].str[6:].astype(str)

df_b['YearB']= df_a['date_of_birth'].str[:4].astype(str)
df_b['MonthB']= df_a['date_of_birth'].str[5:7].astype(str)
df_b['DayB']= df_a['date_of_birth'].str[6:].astype(str)

##### BLOCKCING 
indexer = rl.Index()

# soundex firstname, methapone surname, exact date of birth
indexer.add(Block(['sndx_given_name','mtph_surname','date_of_birth']))
#indexer.block(['sndx_given_name','mtph_surname','date_of_birth'])

# soundex firstname , day of birth
indexer.add(Block(['sndx_given_name','DayB']))

#soundex firstname , month of birth
indexer.add(Block(['sndx_given_name','MonthB']))

# exact date of birth
indexer.add(Block(['date_of_birth']))

# metaphone surname, year of birth 
indexer.add(Block(['mtph_surname','YearB']))

candidate_record_pairs = indexer.index(df_a,df_b)

#### COMPARISON

compare_cl = rl.Compare()
compare_cl.string('given_name', 'given_name', method='jarowinkler', threshold = 0.85, label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler',threshold = 0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('soc_sec_id', 'soc_sec_id', label='soc_sec_id')
compare_cl.string('address_1', 'address_1', method ='levenshtein' ,threshold = 0.85, label='address_1')
compare_cl.string('address_2', 'address_2', method ='levenshtein' ,threshold = 0.85, label='address_2')
compare_cl.string('suburb', 'suburb', method ='levenshtein' ,threshold = 0.85, label='suburb')
compare_cl.exact('postcode', 'postcode', label='postcode')
compare_cl.exact('state', 'state', label='state')

features = compare_cl.compute(candidate_record_pairs, df_a,df_b)

#### CLASSIFICATION UNSUPERVISED with ECM

ecm = rl.ECMClassifier()
result_ecm = ecm.fit(features)

#### EVALUATION
c_m = rl.confusion_matrix(df_true_links, result_ecm, len(features))
print(c_m)

When you run this you have the error in nb_sklearn module

~\AppData\Local\Continuum\anaconda3\lib\site-packages\recordlinkage\algorithms\nb_sklearn.py in _init_parameters_jaro(self, X_bin)
    514             # TODO: check with bin.y_type_
    515 
--> 516             feature_prob[0, :] = np.tile([.9, .1], int(n_features / 2))
    517             feature_prob[1, :] = np.tile([.1, .9], int(n_features / 2))
    518 

**ValueError: could not broadcast input array from shape (16) into shape (17)**

Also it seems the classifier does not work for ferbl1, ferb3 datasets. However, the classifiers works with febrl2 , krebsregister datasets and generated binary vectors. Others please test. Please respond.

meccaLeccaHi commented 5 years ago

I'm also encountering this error whenever I try to run the ECMClassifier with my data. ValueError: Only binary labels are allowed for 'jaro'method. Column 0 has 1 different labels. I've confirmed that there are more than one unique values in each of the columns using .nunique(). When I remove the offending variables from the linkage, I end up with this error instead: ValueError: could not broadcast input array from shape (6) into shape (7). Any advice would be greatly appreciated.