Question on string length / parsing before fuzzy matching.

Under the Performance section of the example notebook, I ran the matcher using:

on = ["first_name", "surname", "dob", "city"]

lt = fuzzymatcher.link_table(df_left, df_right, on, on)

We can then measure the performance with link_table_percentage_correct(link_table) and it was in the region of 70%.

Then there is a following section where it is shown how performance is increased by creating initials for the names and combining first_name and surname. Using link_table_percentage_correct(link_table) we can see that accuracy has increased: 'Percent matches correct: 82.0%'

I then combined all strings into one, using:


df_left['merged'] = df_left.first_name +' '+df_left.surname+' '+df_left.dob+' '+df_left.city+' '+df_left.email

df_right['merged'] = df_right.first_name +' '+df_right.surname+' '+df_right.dob+' '+df_right.city+' '+df_right.email

And then ran the script again, getting the following result:

'Percent matches correct: 97.9%'

It seems the best result is from combining all the strings together.

So the question is, should I be doing this with every dataset? I am currently working with addresses. I was going to parse every bit of the address, but now I may just keep it as one string.

However, unlike the dataset in the example I have no easy way to tell which match is correct.

RobinL / fuzzymatcher

Question on string length / parsing before fuzzy matching. #46