Open lsun907 opened 3 years ago
By the way, the ID in each dataset is a sequence of numbers from 1 to N (the total number of observations in the dataset)
@Isun907 I had a similar issue and I reindex my dataframe df.index = np.arange(len(df))
or do data_df.reset_index(col_level=1, drop=True, inplace=True)
to solve this issue. Someone might have a better solution to this.
df.index = np.arange(len(df))
worked for me. Thanks!
Hi I am linking two datasets. Both of them contain unique id's as identifiers. After reading two datasets into pandas data frames I set those id's as their indexes. So that after the classification, I would be able to figure out which records from each dataset matched. But after setting those Id's as indexes, I am getting an error in the blocking step.
ValueError: index of DataFrame is not unique
I am sure the two IDs do not have duplicates. Here are some of the codes. Can you please help what the problem is?
import pandas as pd import recordlinkage firm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\firmname.csv", index_col='ID_EMPLOYER', encoding='latin-1') ccm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\comphist.csv", index_col='ID_HCONM', encoding='latin-1') indexer = recordlinkage.Index() indexer.block(left_on='EMPLOYER_STATE', right_on='HSTATE') candidates = indexer.index(firm_name, ccm_name)
Then I got this error messsage: ValueError: index of DataFrame is not unique
Can anyone help please?