Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

match_strings(): Any way to return additional columns? #8

Closed dbcandle closed 3 years ago

dbcandle commented 4 years ago

I can't tell if this is a design limitation, or a matter of my being a "Spamonista", rather than a "Pythonista" so please bear with me... ...I'm calling match_strings with 2 dataframes like this:

matches = match_strings(dfREFERENCE[dfREFERENCE_KeyColumn].squeeze(), dfSUBJECT[dfSUBJECT_KeyColumn].squeeze())

As you can see, since my dataframes contain more than 1 column, I specify the columns of interest (dfREFERENCE_KeyColumn & dfSUBJECT_KeyColumn).

Question: Is it fundamentally impossible to have matches return the other columns in the master & duplicates arguments to match_strings? Once I match on a text column (I'm doing "fuzzy matching" or "record-linking" between datasets with a lot of columns), I want to retain all the other columns.

Again, I can't tell if this is a design limitation, or a matter of my being a "Spamonista", rather than a "Pythonista" so please bear with me... while I'm a seasoned and aging IT guy, I mostly "push paperwork" these days, and am new to Python.

dbcandle commented 4 years ago

OK, maybe "RTFM" applies here. After reading more on the "Code" tab, I see an example that uses the StringGrouper class. I think that is what I was looking for... ...fingers crossed!

zhihongchensg commented 4 years ago

@dbcandle are you able to share. am also looking to return additional columns.

thanks plenty

dbcandle commented 4 years ago

I am still struggling because I am new to Python, but see the examples listed on the github page. Notice that the last example, which uses the underlying StringGrouper class, returns the column “Company CIK Key” in addition to the matched columns. This seems to occur on the line:

companies['deduplicated_name'] = string_grouper.get_groups()

In addition to the github page at https://github.com/Bergvca/string_grouper be sure to see the other pages at https://bergvca.github.io/2020/01/02/string-grouper.html and https://bergvca.github.io/2017/10/14/super-fast-string-matching.html

Also look at the main source-code file: string_grouper.py - lots of cmments in the code there!

I think the trick is shown on the github page as mentioned above, but it may be necessary to tap into the indices matrix (see other pages above).

Basically, I have two dataframes I want to link on a string column, with both dataframes having several columns (5-10 columns each). I want the result to include all 10-20 of the columns in the 2 matched dataframes, not just the 2 string columns matched.

If you learn any more, PLEASE let me know! :-)

Phil

Get Outlook for iOShttps://aka.ms/o0ukef


From: zhihongchensg notifications@github.com Sent: Saturday, May 16, 2020 4:48:56 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

@dbcandlehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378584201&sdata=zEG%2FDukuubI3yulKdhxUtPpTPrERMF6666LpcvZDBbM%3D&reserved=0 are you able to share. am also looking to return additional columns.

thanks plenty

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629611436&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=UsroooIS0pHsWQJoDtoq9kBHTZ0HoruCaI1kf%2Bb%2B0Ps%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F2TOFVE5MEXS4HESYLRRZHPRANCNFSM4M77OWQA&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=sdj%2BRQ5xoDRaxG5m2qNKrqJQ4xhQcbuMNprvmwsq6a4%3D&reserved=0.

dbcandle commented 4 years ago

In looking more, I just do not think the string_grouper package can return columns other than the 2 string columns being matched.

I think it is necessary to iterate through individual rows of one of the two dataframes, making a call to get_groups or get_match for each row.

Still experimenting though...

Get Outlook for iOShttps://aka.ms/o0ukef


From: zhihongchensg notifications@github.com Sent: Saturday, May 16, 2020 4:48:56 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

@dbcandlehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378584201&sdata=zEG%2FDukuubI3yulKdhxUtPpTPrERMF6666LpcvZDBbM%3D&reserved=0 are you able to share. am also looking to return additional columns.

thanks plenty

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629611436&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=UsroooIS0pHsWQJoDtoq9kBHTZ0HoruCaI1kf%2Bb%2B0Ps%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F2TOFVE5MEXS4HESYLRRZHPRANCNFSM4M77OWQA&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=sdj%2BRQ5xoDRaxG5m2qNKrqJQ4xhQcbuMNprvmwsq6a4%3D&reserved=0.

zhihongchensg commented 4 years ago

am not sure, i suspect to get the additional column (or just the original index), requires tingling with sparse_dot_topn. whatever additional data required has to be "tagged along". And sparse_dot_topn only stores the top x number of results so cant just retrieve the index. Maybe the easier solution is simply to do another round of matching using the result returned and append the additional columns required - in pd.

dbcandle commented 4 years ago

I think the trick is to use “._matches_list” from the StringGrouper class. I need to do some testing, but it looks like it returns the indexes and similarity after .fit() is run on the StringGrouper class.

Phil

From: zhihongchensg notifications@github.com Sent: Sunday, May 17, 2020 1:26 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

am not sure, i suspect to get the additional column (or just the original index), requires tingling with sparse_dot_topn. whatever additional data required has to be "tagged along". And sparse_dot_topn only stores the top x number of results so cant just retrieve the index. Maybe the easier solution is simply to do another round of matching using the result returned and append the additional columns required - in pd.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629745560&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399773407&sdata=W38Ulw49JIz15tF2GP1GPyzLEEekU%2FiugaSC%2BySLVVU%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F6WPE4FHSKYW7UOYIDRR5YNHANCNFSM4M77OWQA&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399783408&sdata=mhxBn9mZkKKSTaA%2FmGZ0U%2FI4UbVk3WcRkWaX44vEzYY%3D&reserved=0.

dbcandle commented 4 years ago

I got it to work: including other columns, not just the similarity and two text columns being compared.

Source code enclosed.

I’m new to Python. In fact, other than VBA, I’ve not programmed much in recent years, but I was a programmer for many years starting in about 1980. I switched to “paperwork” for the most part, but am still in “IT”.

I’ll be most interested in finding a way to make the main loop more efficient. Not sure if I will “fiddle” with Python, or load the matrix back to SQL Server, and process there.

==> If you find anything grossly inefficient in the main Python loop starting on line 198, “I’m all ears”! 😊 <==

Phil

From: zhihongchensg notifications@github.com Sent: Sunday, May 17, 2020 1:26 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

am not sure, i suspect to get the additional column (or just the original index), requires tingling with sparse_dot_topn. whatever additional data required has to be "tagged along". And sparse_dot_topn only stores the top x number of results so cant just retrieve the index. Maybe the easier solution is simply to do another round of matching using the result returned and append the additional columns required - in pd.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629745560&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399773407&sdata=W38Ulw49JIz15tF2GP1GPyzLEEekU%2FiugaSC%2BySLVVU%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F6WPE4FHSKYW7UOYIDRR5YNHANCNFSM4M77OWQA&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399783408&sdata=mhxBn9mZkKKSTaA%2FmGZ0U%2FI4UbVk3WcRkWaX44vEzYY%3D&reserved=0.

Bergvca commented 4 years ago

Hi @dbcandle,

I don't see the code attached. Would love to see what you are trying to do and if it can be improved. Perhaps it's functionality that can be added to the library?

dbcandle commented 4 years ago

I placed the 2 files I sent in the prior email on github: https://github.com/dbcandle/Test

Can you see that?

A few notes:

Anyway, it is an honor that you would even contact me. It looks like originally, another user of string_grouper contacted me, and CCed you, but I’m not sure. Anyway…

May the source be with you! 😊

Phil

From: Chris van den Berg notifications@github.com Sent: Saturday, May 23, 2020 5:23 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

Hi @dbcandlehttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7Cd2728a0417144cef755208d7fefad250%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637258225523756769&sdata=Fz2V1er5PTd4CXRG14jGzInsGfpjfF54LoC%2BrPlpdSQ%3D&reserved=0,

I don't see the code attached. Would love to see what you are trying to do and if it can be improved. Perhaps it's functionality that can be added to the library?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-633013280&data=02%7C01%7C%7Cd2728a0417144cef755208d7fefad250%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637258225523766764&sdata=ME3C2qzv9ETtpe78NNRJmW8KCz5Inc59nJU9TgUtMJs%3D&reserved=0, or unsubscribehttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F3CQEUCAKG2GV6JZKLRS6IVPANCNFSM4M77OWQA&data=02%7C01%7C%7Cd2728a0417144cef755208d7fefad250%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637258225523776732&sdata=Qz6VGZujV1NtXvzACHuY%2BvK%2FLMFSDQgr4eWW6ihHnf4%3D&reserved=0.

Bergvca commented 4 years ago

Hi @dbcandle,

It seems to me you want to join the reference table to the subject table using the output of match_strings, is that correct? So dfREFERENCE --> matches --> dfSUBJECT? If that is the case you can just use the merge or join function from pandas.

Something like:

matches = match_strings(dfREFERENCE[dfREFERENCE_KeyColumn], dfSUBJECT[dfSUBJECT_KeyColumn])
df = dfSUBJECT.merge(matches, left_on=dfREFERENCE_KeyColumn, right_on='left_side') # defaults to left join - not sure if that is what you want
df = df.merge(matches, left_on='right_side', right_on='dfSUBJECT_KeyColumn')

I did not test this, but it should give you the general idea.

dbcandle commented 4 years ago

Awesome!

Get Outlook for iOShttps://aka.ms/o0ukef


From: Chris van den Berg notifications@github.com Sent: Monday, May 25, 2020 3:12:10 PM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

Hi @dbcandlehttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7C7fdf1ea097cd4013a28008d800df86b2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637260307317049342&sdata=S%2F4b8UaRNQmNz0FBop6CKRkx4W2x%2BZ3UeBAe%2BjiS2V8%3D&reserved=0,

It seems to me you want to join the reference table to the subject table using the output of match_strings, is that correct? So dfREFERENCE --> matches --> dfSUBJECT? If that is the case you can just use the merge or join function from pandas.

Something like:

matches = match_strings(dfREFERENCE[dfREFERENCE_KeyColumn], dfSUBJECT[dfSUBJECT_KeyColumn]) df = dfSUBJECT.merge(matches, left_on=dfREFERENCE_KeyColumn, right_on='left_side') # defaults to left join - not sure if that is what you want df = df.merge(matches, left_on='right_side', right_on='dfSUBJECT_KeyColumn')

I did not test this, but it should give you the general idea.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-633688805&data=02%7C01%7C%7C7fdf1ea097cd4013a28008d800df86b2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637260307317049342&sdata=y%2BY%2FQDkFC9MLM9zXM92AGnIefs%2BCwKIZpkyYkfv1F0E%3D&reserved=0, or unsubscribehttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F5GAOMUKOLE6GUFK4DRTK7IVANCNFSM4M77OWQA&data=02%7C01%7C%7C7fdf1ea097cd4013a28008d800df86b2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637260307317059338&sdata=P5GD%2BGvTShMaIvLc4T%2FHuoWmCnvIjdjbcqhXEXmVO1I%3D&reserved=0.

Bergvca commented 4 years ago

@dbcandle - out of curiosity, did that suggestion help?

dbcandle commented 4 years ago

PERFECTLY!

I’ve been happily using string_grouper ever since!!!

Phil

From: Chris van den Berg notifications@github.com Sent: Wednesday, July 15, 2020 2:48 PM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

@dbcandlehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7C6d05f98868634805885908d828ef8cf7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637304356603611139&sdata=2SJTNL%2BrLSgSDH88R16YjDBRuPAinymfQNqhVSuRNXo%3D&reserved=0 - out of curiosity, did that suggestion help?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-658939277&data=02%7C01%7C%7C6d05f98868634805885908d828ef8cf7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637304356603611139&sdata=S33%2FhqEL9YNpgOOvl3Wmiijp0hDNT51nP%2F%2BHSH2jDbM%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F5FVPFKWLGYOCY2HXTR3X2UVANCNFSM4M77OWQA&data=02%7C01%7C%7C6d05f98868634805885908d828ef8cf7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637304356603621133&sdata=Ec5pp0gAlOwllCKAKr9gPvwE5KsusKACH2d2eh4wCeI%3D&reserved=0.

Bergvca commented 3 years ago

This is now supported out of the box with the optional "id" variables.

dbcandle commented 3 years ago

A W E S O M E ! ! !

Sent from my iPad

On Feb 18, 2021, at 3:45 PM, Chris van den Berg notifications@github.com wrote:

This is now supported out of the box with the optional "id" variables.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.