Closed soliverc closed 5 years ago
In this case, the higher link rate seems to be because you've included the email address.
In general i would expect there to be a slight performance improvement of including individual fields rather than concatenating everything, because the match scores are computed using token frequency within column.
This means a 'concat all' approach wouldn't work very well if e.g. you had two boolean columns, one with very unbalanced categories, and a second with very balanced.
Under the Performance section of the example notebook, I ran the matcher using:
We can then measure the performance with
link_table_percentage_correct(link_table)
and it was in the region of 70%.Then there is a following section where it is shown how performance is increased by creating initials for the names and combining
first_name
andsurname
. Usinglink_table_percentage_correct(link_table)
we can see that accuracy has increased:'Percent matches correct: 82.0%'
I then combined all strings into one, using:
And then ran the script again, getting the following result:
'Percent matches correct: 97.9%'
It seems the best result is from combining all the strings together.
So the question is, should I be doing this with every dataset? I am currently working with addresses. I was going to parse every bit of the address, but now I may just keep it as one string.
However, unlike the dataset in the example I have no easy way to tell which match is correct.