Closed bergen288 closed 10 months ago
I have figured out the root cause of my issue. Basically, it's caused by the multiple definitions of following rl_comparer code in my function when looping over addresses.
rl_comparer.exact('street_number', 'street_number', label="street_number")
rl_comparer.string('street_name', 'street_name', threshold=0.8, label="street_name")
rl_comparer.string('name_lower', 'name_lower', threshold=0.5, label="name_lower")
They have to be moved out of my function to resolve the issue.
rl_indexer = rl.Index()
rl_comparer = rl.Compare()
rl_indexer.block('postal_code')
rl_comparer.exact('street_number', 'street_number', label="street_number")
rl_comparer.string('street_name', 'street_name', threshold=0.8, label="street_name")
rl_comparer.string('name_lower', 'name_lower', threshold=0.5, label="name_lower")
def rl_matching(df_master, df_std, rl_indexer=rl_indexer, rl_comparer=rl_comparer):
candidate_links = rl_indexer.index(df_master, df_std)
feature_vectors = rl_comparer.compute(candidate_links, df_master, df_std)
N = 3
feature_vectors = feature_vectors.astype('int')
print(feature_vectors.head(10))
......................
additional codes
......................
return matched_address_total
Resolved.
My use case is that there are 2 dataframes with US addresses. Both standard and master dataframes have same columns with 'postal_code', 'street_number', 'street_name', and 'name_lower'. In the same 'postal_code', I want to match address with the exact same 'street_number', very similar "street_name" (threshold=0.8), and similar name in lowcase (threshold=0.5). Below is my rl_matching function.
Because master dataframe is very large, there is memory allocation error to process all data together. Below is my code to loop over zip code:
Below is the print output. '01001' actually is the 1st processing zip code as there are 3 zip codes with either len(df_master) = 0 or len(df_std) = 0. As you can see, zip code 01001 has 3 matching columns, zip code 01002 has 6 matching columns (2 sets of street_number street_name name_lower), zip code 01003 has 9 matching columns (3 sets of street_number street_name name_lower), and it keeps going. I can use "feature_vectors = feature_vectors.iloc[:, :3]" to keep 1st 3 columns only as a workaround. But I would like to report this issue for your consideration.