J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
955 stars 153 forks source link

Duplicated matching columns with rl_comparer.compute while looping over zip code #200

Closed bergen288 closed 10 months ago

bergen288 commented 11 months ago

My use case is that there are 2 dataframes with US addresses. Both standard and master dataframes have same columns with 'postal_code', 'street_number', 'street_name', and 'name_lower'. In the same 'postal_code', I want to match address with the exact same 'street_number', very similar "street_name" (threshold=0.8), and similar name in lowcase (threshold=0.5). Below is my rl_matching function.

import recordlinkage as rl
rl_indexer = rl.Index()
rl_comparer = rl.Compare()
def rl_matching(df_master, df_std):
    rl_indexer.block('postal_code')
    candidate_links = rl_indexer.index(df_master, df_std)
    N = 3
    rl_comparer.exact('street_number', 'street_number', label="street_number")
    rl_comparer.string('street_name', 'street_name', threshold=0.8, label="street_name")
    rl_comparer.string('name_lower', 'name_lower', threshold=0.5, label="name_lower")
    feature_vectors = rl_comparer.compute(candidate_links, df_master, df_std)
    # feature_vectors = feature_vectors.iloc[:, :N]
    feature_vectors = feature_vectors.astype('int')
    print(feature_vectors.head(10))
    ......................
    additional codes
    ......................
    return matched_address_total

Because master dataframe is very large, there is memory allocation error to process all data together. Below is my code to loop over zip code:

for i, zip5 in enumerate(sorted(unique_zips_total[previous_position:end_position])):
    out_file5 = f"linked_{zip5}.parquet"
    if not exists(out_file5):
        df_std = df_std_total[df_std_total['postal_code'] == zip5]
        df_master = df_master_total[df_master_total['postal_code'] == zip5]
        print(f'The {i}th processing zip code is:', zip5)
        if len(df_master) > 0 and len(df_std) > 0:
            matched_data = rl_matching(df_master, df_std)
            print('len of matches is:', len(matched_data))
            if len(matched_data) > 0:
                matched_data.to_parquet(out_file5)
            del matched_data
        else:
            print('len of matches is 0.')
        del df_master, df_std

Below is the print output. '01001' actually is the 1st processing zip code as there are 3 zip codes with either len(df_master) = 0 or len(df_std) = 0. As you can see, zip code 01001 has 3 matching columns, zip code 01002 has 6 matching columns (2 sets of street_number street_name name_lower), zip code 01003 has 9 matching columns (3 sets of street_number street_name name_lower), and it keeps going. I can use "feature_vectors = feature_vectors.iloc[:, :3]" to keep 1st 3 columns only as a workaround. But I would like to report this issue for your consideration.

The 3th processing zip code is: 01001
            street_number  street_name  name_lower
4266 14975              0            0           0
     24830              1            1           0
4267 14975              0            0           0
     24830              1            1           0
4268 14975              0            0           0
     24830              1            1           0
4298 14975              1            1           1
     24830              0            0           0
4299 14975              1            1           1
     24830              0            0           0
The 4th processing zip code is: 01002
            street_number  street_name  name_lower  street_number  street_name  name_lower
4540 2722               0            0           0              0            0           0
     2727               1            1           0              1            1           0
     18386              0            0           0              0            0           0
     28789              0            0           0              0            0           0
     28805              0            0           0              0            0           0
     31714              0            0           0              0            0           0
     47413              0            0           0              0            0           0
     47791              0            0           0              0            0           0
     50939              0            0           0              0            0           0
4541 2722               0            0           0              0            0           0
The 5th processing zip code is: 01003
            street_number  street_name  name_lower  street_number  street_name  name_lower  street_number  street_name  name_lower
5271 57155              1            1           0              1            1           0              1            1           0
5272 57155              1            1           0              1            1           0              1            1           0
5273 57155              1            1           0              1            1           0              1            1           0
bergen288 commented 10 months ago

I have figured out the root cause of my issue. Basically, it's caused by the multiple definitions of following rl_comparer code in my function when looping over addresses.

    rl_comparer.exact('street_number', 'street_number', label="street_number")
    rl_comparer.string('street_name', 'street_name', threshold=0.8, label="street_name")
    rl_comparer.string('name_lower', 'name_lower', threshold=0.5, label="name_lower")

They have to be moved out of my function to resolve the issue.

rl_indexer = rl.Index()
rl_comparer = rl.Compare()
rl_indexer.block('postal_code')
rl_comparer.exact('street_number', 'street_number', label="street_number")
rl_comparer.string('street_name', 'street_name', threshold=0.8, label="street_name")
rl_comparer.string('name_lower', 'name_lower', threshold=0.5, label="name_lower")

def rl_matching(df_master, df_std, rl_indexer=rl_indexer, rl_comparer=rl_comparer):
    candidate_links = rl_indexer.index(df_master, df_std)
    feature_vectors = rl_comparer.compute(candidate_links, df_master, df_std)
    N = 3
    feature_vectors = feature_vectors.astype('int')
    print(feature_vectors.head(10))
    ......................
    additional codes
    ......................
    return matched_address_total
bergen288 commented 10 months ago

Resolved.