J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
966 stars 152 forks source link

Recordlinkage, ValueError: index of DataFrame is not unique #157

Open lsun907 opened 3 years ago

lsun907 commented 3 years ago

Hi I am linking two datasets. Both of them contain unique id's as identifiers. After reading two datasets into pandas data frames I set those id's as their indexes. So that after the classification, I would be able to figure out which records from each dataset matched. But after setting those Id's as indexes, I am getting an error in the blocking step.

ValueError: index of DataFrame is not unique

I am sure the two IDs do not have duplicates. Here are some of the codes. Can you please help what the problem is?


import pandas as pd import recordlinkage firm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\firmname.csv", index_col='ID_EMPLOYER', encoding='latin-1') ccm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\comphist.csv", index_col='ID_HCONM', encoding='latin-1') indexer = recordlinkage.Index() indexer.block(left_on='EMPLOYER_STATE', right_on='HSTATE') candidates = indexer.index(firm_name, ccm_name)


Then I got this error messsage: ValueError: index of DataFrame is not unique

Can anyone help please?

lsun907 commented 3 years ago

By the way, the ID in each dataset is a sequence of numbers from 1 to N (the total number of observations in the dataset)

titipata commented 3 years ago

@Isun907 I had a similar issue and I reindex my dataframe df.index = np.arange(len(df)) or do data_df.reset_index(col_level=1, drop=True, inplace=True) to solve this issue. Someone might have a better solution to this.

ethan-huffington commented 2 years ago

df.index = np.arange(len(df)) worked for me. Thanks!