What would be the best approach to match a string to an existing dataframe's column of strings?

ParticularMiner / red_string_grouper

Record Equivalence Discoverer based on String Grouper

MIT License

4 stars 2 forks source link

What would be the best approach to match a string to an existing dataframe's column of strings? #3

Closed iibarant closed 2 years ago

iibarant commented 2 years ago

Hi @ParticularMiner,

I hope you are doing good. Imagine I have an address column in a dataframe and I want to find the best matches from it to an address that I enter in a query. One approach would be to add an extra row in the dataframe with the address field from the query and run the standard match_string. Another approach would be to create a dataframe with one row and match the 2 with match_string with an extra df.

What would you suggest?

I will be trying to play in both directions.

Thank you!

ParticularMiner commented 2 years ago

Hi @iibarant

I would think the latter approach is better and much faster because it finds only those matches between the single string in the new Series and the strings in the other larger Series.

The issue with the former approach is that it also includes matches between all strings in the Series which is not what you are looking for and therefore inefficient.

By the way, regarding the California data you mentioned some time back, you could consider using the regex option of red_string_grouper (or string_grouper) to remove those common abbreviations from the strings in the Address field before performing the comparison. Then the matching would proceed much faster and there would be no matching based on those common abbreviations.

iibarant commented 2 years ago

Hi @ParticularMiner,

That works perfectly. Thank you. A question - a tester put completely different address with only state, city and zip code matching and similarity = 0 just to test expecting many records to show up. The procedure returned empty data frame. Why?

ParticularMiner commented 2 years ago

Hi @iibarant

Sorry I did not understand your last question.

Did you mean: someone entered a new string in the address field. This string had state, city and zip code information which was expected to match many records. But nothing showed up?

If so, how large was the entire DataFrame and was the threshold similarity value (min_similarity) low enough?

iibarant commented 2 years ago

Yes, you got it right. The dataframe is much smaller - around 7 thousand records, the min_similarity was set to 0.

On Sep 29, 2021, at 1:37 PM, ParticularMiner @.***> wrote:

Hi @iibarant

Sorry I did not understand your last question.

Did you mean: someone entered a new string in the address field. This string had state, city and zip code information which was expected to match many records. But nothing showed up?

If so, how large was the entire DataFrame and was the threshold similarity value (min_similarity) low enough?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ParticularMiner commented 2 years ago

@iibarant And you used the original match_strings() function from string_grouper? Were any other options used?

ParticularMiner commented 2 years ago

@iibarant

So the call was something like this:

match_strings(df['address'], df_with_one_new_string['address'], min_similarity=0) ?

iibarant commented 2 years ago

I found a global variable that was used on my side and kept the similarity high enough. Looks good now.