RobinL / fuzzymatcher

Record linking package that fuzzy matches two Python pandas dataframes using sqlite3 fts4
MIT License
281 stars 60 forks source link

Question on string length / parsing before fuzzy matching. #46

Closed soliverc closed 5 years ago

soliverc commented 5 years ago

Under the Performance section of the example notebook, I ran the matcher using:

on = ["first_name", "surname", "dob", "city"]

lt = fuzzymatcher.link_table(df_left, df_right, on, on)

We can then measure the performance with link_table_percentage_correct(link_table) and it was in the region of 70%.

Then there is a following section where it is shown how performance is increased by creating initials for the names and combining first_name and surname. Using link_table_percentage_correct(link_table) we can see that accuracy has increased: 'Percent matches correct: 82.0%'

I then combined all strings into one, using:


df_left['merged'] = df_left.first_name +' '+df_left.surname+' '+df_left.dob+' '+df_left.city+' '+df_left.email

df_right['merged'] = df_right.first_name +' '+df_right.surname+' '+df_right.dob+' '+df_right.city+' '+df_right.email

And then ran the script again, getting the following result:

'Percent matches correct: 97.9%'

It seems the best result is from combining all the strings together.

So the question is, should I be doing this with every dataset? I am currently working with addresses. I was going to parse every bit of the address, but now I may just keep it as one string.

However, unlike the dataset in the example I have no easy way to tell which match is correct.

RobinL commented 5 years ago

In this case, the higher link rate seems to be because you've included the email address.

In general i would expect there to be a slight performance improvement of including individual fields rather than concatenating everything, because the match scores are computed using token frequency within column.

This means a 'concat all' approach wouldn't work very well if e.g. you had two boolean columns, one with very unbalanced categories, and a second with very balanced.