match_strings(): Any way to return additional columns?

dbcandle commented 4 years ago

I can't tell if this is a design limitation, or a matter of my being a "Spamonista", rather than a "Pythonista" so please bear with me... ...I'm calling match_strings with 2 dataframes like this:

matches = match_strings(dfREFERENCE[dfREFERENCE_KeyColumn].squeeze(), dfSUBJECT[dfSUBJECT_KeyColumn].squeeze())

As you can see, since my dataframes contain more than 1 column, I specify the columns of interest (dfREFERENCE_KeyColumn & dfSUBJECT_KeyColumn).

Question: Is it fundamentally impossible to have matches return the other columns in the master & duplicates arguments to match_strings? Once I match on a text column (I'm doing "fuzzy matching" or "record-linking" between datasets with a lot of columns), I want to retain all the other columns.

Again, I can't tell if this is a design limitation, or a matter of my being a "Spamonista", rather than a "Pythonista" so please bear with me... while I'm a seasoned and aging IT guy, I mostly "push paperwork" these days, and am new to Python.

dbcandle commented 4 years ago

OK, maybe "RTFM" applies here. After reading more on the "Code" tab, I see an example that uses the StringGrouper class. I think that is what I was looking for... ...fingers crossed!

zhihongchensg commented 4 years ago

@dbcandle are you able to share. am also looking to return additional columns.

thanks plenty

dbcandle commented 4 years ago

I am still struggling because I am new to Python, but see the examples listed on the github page. Notice that the last example, which uses the underlying StringGrouper class, returns the column “Company CIK Key” in addition to the matched columns. This seems to occur on the line:

companies['deduplicated_name'] = string_grouper.get_groups()

In addition to the github page at https://github.com/Bergvca/string_grouper be sure to see the other pages at https://bergvca.github.io/2020/01/02/string-grouper.html and https://bergvca.github.io/2017/10/14/super-fast-string-matching.html

Also look at the main source-code file: string_grouper.py - lots of cmments in the code there!

I think the trick is shown on the github page as mentioned above, but it may be necessary to tap into the indices matrix (see other pages above).

Basically, I have two dataframes I want to link on a string column, with both dataframes having several columns (5-10 columns each). I want the result to include all 10-20 of the columns in the 2 matched dataframes, not just the 2 string columns matched.

If you learn any more, PLEASE let me know! :-)

Phil

Get Outlook for iOShttps://aka.ms/o0ukef

From: zhihongchensg notifications@github.com Sent: Saturday, May 16, 2020 4:48:56 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

@dbcandlehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378584201&sdata=zEG%2FDukuubI3yulKdhxUtPpTPrERMF6666LpcvZDBbM%3D&reserved=0 are you able to share. am also looking to return additional columns.

thanks plenty

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629611436&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=UsroooIS0pHsWQJoDtoq9kBHTZ0HoruCaI1kf%2Bb%2B0Ps%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F2TOFVE5MEXS4HESYLRRZHPRANCNFSM4M77OWQA&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=sdj%2BRQ5xoDRaxG5m2qNKrqJQ4xhQcbuMNprvmwsq6a4%3D&reserved=0.

dbcandle commented 4 years ago

In looking more, I just do not think the string_grouper package can return columns other than the 2 string columns being matched.

I think it is necessary to iterate through individual rows of one of the two dataframes, making a call to get_groups or get_match for each row.

Still experimenting though...

Get Outlook for iOShttps://aka.ms/o0ukef

From: zhihongchensg notifications@github.com Sent: Saturday, May 16, 2020 4:48:56 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

@dbcandlehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378584201&sdata=zEG%2FDukuubI3yulKdhxUtPpTPrERMF6666LpcvZDBbM%3D&reserved=0 are you able to share. am also looking to return additional columns.

thanks plenty

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629611436&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=UsroooIS0pHsWQJoDtoq9kBHTZ0HoruCaI1kf%2Bb%2B0Ps%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F2TOFVE5MEXS4HESYLRRZHPRANCNFSM4M77OWQA&data=02%7C01%7C%7Cdf40c87ae6504b56dce408d7f975f86e%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252157378594194&sdata=sdj%2BRQ5xoDRaxG5m2qNKrqJQ4xhQcbuMNprvmwsq6a4%3D&reserved=0.

zhihongchensg commented 4 years ago

am not sure, i suspect to get the additional column (or just the original index), requires tingling with sparse_dot_topn. whatever additional data required has to be "tagged along". And sparse_dot_topn only stores the top x number of results so cant just retrieve the index. Maybe the easier solution is simply to do another round of matching using the result returned and append the additional columns required - in pd.

dbcandle commented 4 years ago

I think the trick is to use “._matches_list” from the StringGrouper class. I need to do some testing, but it looks like it returns the indexes and similarity after .fit() is run on the StringGrouper class.

Phil

From: zhihongchensg notifications@github.com Sent: Sunday, May 17, 2020 1:26 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

am not sure, i suspect to get the additional column (or just the original index), requires tingling with sparse_dot_topn. whatever additional data required has to be "tagged along". And sparse_dot_topn only stores the top x number of results so cant just retrieve the index. Maybe the easier solution is simply to do another round of matching using the result returned and append the additional columns required - in pd.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629745560&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399773407&sdata=W38Ulw49JIz15tF2GP1GPyzLEEekU%2FiugaSC%2BySLVVU%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F6WPE4FHSKYW7UOYIDRR5YNHANCNFSM4M77OWQA&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399783408&sdata=mhxBn9mZkKKSTaA%2FmGZ0U%2FI4UbVk3WcRkWaX44vEzYY%3D&reserved=0.

dbcandle commented 4 years ago

I got it to work: including other columns, not just the similarity and two text columns being compared.

Source code enclosed.

It initially pulls data from SQL Server, so you can ignore that. Still, I enclose sql_server.py, should it amuse you.
Rather than installing the string_grouper package, I added string_grouper.py to my project, and import from it directly.
In the source code, I use the name dfREFERENCE rather than master, and dfSUBJECT, rather than duplicates. For example, dfREFERENCE can be a published high-quality data-set. dfSUBJECT is the “subject” of the comparison; messy data being matched against the “reference” data.
The “core” of this – our initial dialog below - can be found starting on line 171. The key is to use string_grouper._matches_list after calling string_grouper.fit(). It is a simple matrix with the index of the 2 tables being compared, and the similarity.
I ran this with about 1.1M records in dfREFERENCE, and 20K in dfSUBJECT:
- String_grouper.fit() ran in about 2 minutes.
- The main loop starting on line 198 ran in about 17 hours on a 5th generation i5 laptop, with good SSD and 16Gb RAM.

I’m new to Python. In fact, other than VBA, I’ve not programmed much in recent years, but I was a programmer for many years starting in about 1980. I switched to “paperwork” for the most part, but am still in “IT”.

I’ll be most interested in finding a way to make the main loop more efficient. Not sure if I will “fiddle” with Python, or load the matrix back to SQL Server, and process there.

==> If you find anything grossly inefficient in the main Python loop starting on line 198, “I’m all ears”! 😊 <==

Phil

From: zhihongchensg notifications@github.com Sent: Sunday, May 17, 2020 1:26 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

am not sure, i suspect to get the additional column (or just the original index), requires tingling with sparse_dot_topn. whatever additional data required has to be "tagged along". And sparse_dot_topn only stores the top x number of results so cant just retrieve the index. Maybe the easier solution is simply to do another round of matching using the result returned and append the additional columns required - in pd.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-629745560&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399773407&sdata=W38Ulw49JIz15tF2GP1GPyzLEEekU%2FiugaSC%2BySLVVU%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F6WPE4FHSKYW7UOYIDRR5YNHANCNFSM4M77OWQA&data=02%7C01%7C%7C4b4323d228834cebe13408d7fa22bc96%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637252899399783408&sdata=mhxBn9mZkKKSTaA%2FmGZ0U%2FI4UbVk3WcRkWaX44vEzYY%3D&reserved=0.

Bergvca commented 4 years ago

Hi @dbcandle,

I don't see the code attached. Would love to see what you are trying to do and if it can be improved. Perhaps it's functionality that can be added to the library?

dbcandle commented 4 years ago

I placed the 2 files I sent in the prior email on github: https://github.com/dbcandle/Test

Can you see that?

A few notes:

Please don’t laugh at the code… I’m more of a paper-pusher these days (but was formerly a programmer since 1980).
The code that does what I was struggling with is in the main file: “main refactor (SQL Server Hybrid v4.0).py”. I also included “sql_server.py” should it amuse you (I’m an ex-Oracle DBA, and use SQL Server quite a bit).
Rather than the terms “master” and “duplicates”, my own code uses the terms:
- dfSUBJECT – This is the “subject” of the searching, for example a messy list of software products (e.g. “Microzoft Office”).
- dfREFERENCE – This is what I’m comparing the “subject” data with, for example a very precise and accurate list of software products (e.g. “Microsoft Office 365”).
The “main loop” starts on line 171. It is based on:
- Calling string_grouper._matches_list directly, after calling string_grouper.fit().
- Using pandas to concatenate rows from dfREFERENCE and dfSUBJECT. This allows returning any additional columns, not just the 2 columns used to generate string_grouper._matches_list.
Again, please forgive my ignorance of Python. For years, I’ve been looking for a fast and effective way to do “fuzzy matching” (a.k.a. record linking), but tools like OpenRefine and other free offerings were FAR TOO SLOW. I suspect the reason that the “main loop” takes 17 hours for my data (dfSUBJECT has 20K rows, dfREFERENCE has 1.1M rows) is my use of pandas for each loop iteration. I suspect that the main loop would be better using native Python data-types, but that is beyond my Python skill at this time. I’m struggling to learn, simultaneously:
- scikit-learn
- pandas
- python
- git
- github
- Pycharm

Anyway, it is an honor that you would even contact me. It looks like originally, another user of string_grouper contacted me, and CCed you, but I’m not sure. Anyway…

May the source be with you! 😊

Phil

From: Chris van den Berg notifications@github.com Sent: Saturday, May 23, 2020 5:23 AM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

Hi @dbcandlehttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7Cd2728a0417144cef755208d7fefad250%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637258225523756769&sdata=Fz2V1er5PTd4CXRG14jGzInsGfpjfF54LoC%2BrPlpdSQ%3D&reserved=0,

I don't see the code attached. Would love to see what you are trying to do and if it can be improved. Perhaps it's functionality that can be added to the library?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-633013280&data=02%7C01%7C%7Cd2728a0417144cef755208d7fefad250%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637258225523766764&sdata=ME3C2qzv9ETtpe78NNRJmW8KCz5Inc59nJU9TgUtMJs%3D&reserved=0, or unsubscribehttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F3CQEUCAKG2GV6JZKLRS6IVPANCNFSM4M77OWQA&data=02%7C01%7C%7Cd2728a0417144cef755208d7fefad250%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637258225523776732&sdata=Qz6VGZujV1NtXvzACHuY%2BvK%2FLMFSDQgr4eWW6ihHnf4%3D&reserved=0.

Bergvca commented 4 years ago

Hi @dbcandle,

It seems to me you want to join the reference table to the subject table using the output of match_strings, is that correct? So dfREFERENCE --> matches --> dfSUBJECT? If that is the case you can just use the merge or join function from pandas.

Something like:

matches = match_strings(dfREFERENCE[dfREFERENCE_KeyColumn], dfSUBJECT[dfSUBJECT_KeyColumn])
df = dfSUBJECT.merge(matches, left_on=dfREFERENCE_KeyColumn, right_on='left_side') # defaults to left join - not sure if that is what you want
df = df.merge(matches, left_on='right_side', right_on='dfSUBJECT_KeyColumn')

I did not test this, but it should give you the general idea.

dbcandle commented 4 years ago

Awesome!

Get Outlook for iOShttps://aka.ms/o0ukef

From: Chris van den Berg notifications@github.com Sent: Monday, May 25, 2020 3:12:10 PM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

Hi @dbcandlehttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7C7fdf1ea097cd4013a28008d800df86b2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637260307317049342&sdata=S%2F4b8UaRNQmNz0FBop6CKRkx4W2x%2BZ3UeBAe%2BjiS2V8%3D&reserved=0,

It seems to me you want to join the reference table to the subject table using the output of match_strings, is that correct? So dfREFERENCE --> matches --> dfSUBJECT? If that is the case you can just use the merge or join function from pandas.

Something like:

matches = match_strings(dfREFERENCE[dfREFERENCE_KeyColumn], dfSUBJECT[dfSUBJECT_KeyColumn]) df = dfSUBJECT.merge(matches, left_on=dfREFERENCE_KeyColumn, right_on='left_side') # defaults to left join - not sure if that is what you want df = df.merge(matches, left_on='right_side', right_on='dfSUBJECT_KeyColumn')

I did not test this, but it should give you the general idea.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-633688805&data=02%7C01%7C%7C7fdf1ea097cd4013a28008d800df86b2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637260307317049342&sdata=y%2BY%2FQDkFC9MLM9zXM92AGnIefs%2BCwKIZpkyYkfv1F0E%3D&reserved=0, or unsubscribehttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F5GAOMUKOLE6GUFK4DRTK7IVANCNFSM4M77OWQA&data=02%7C01%7C%7C7fdf1ea097cd4013a28008d800df86b2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637260307317059338&sdata=P5GD%2BGvTShMaIvLc4T%2FHuoWmCnvIjdjbcqhXEXmVO1I%3D&reserved=0.

Bergvca commented 4 years ago

@dbcandle - out of curiosity, did that suggestion help?

dbcandle commented 4 years ago

PERFECTLY!

I’ve been happily using string_grouper ever since!!!

Phil

From: Chris van den Berg notifications@github.com Sent: Wednesday, July 15, 2020 2:48 PM To: Bergvca/string_grouper string_grouper@noreply.github.com Cc: dbcandle paperucci@outlook.com; Mention mention@noreply.github.com Subject: Re: [Bergvca/string_grouper] match_strings(): Any way to return additional columns? (#8)

@dbcandlehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdbcandle&data=02%7C01%7C%7C6d05f98868634805885908d828ef8cf7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637304356603611139&sdata=2SJTNL%2BrLSgSDH88R16YjDBRuPAinymfQNqhVSuRNXo%3D&reserved=0 - out of curiosity, did that suggestion help?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBergvca%2Fstring_grouper%2Fissues%2F8%23issuecomment-658939277&data=02%7C01%7C%7C6d05f98868634805885908d828ef8cf7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637304356603611139&sdata=S33%2FhqEL9YNpgOOvl3Wmiijp0hDNT51nP%2F%2BHSH2jDbM%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHNC6F5FVPFKWLGYOCY2HXTR3X2UVANCNFSM4M77OWQA&data=02%7C01%7C%7C6d05f98868634805885908d828ef8cf7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637304356603621133&sdata=Ec5pp0gAlOwllCKAKr9gPvwE5KsusKACH2d2eh4wCeI%3D&reserved=0.

Bergvca commented 3 years ago

This is now supported out of the box with the optional "id" variables.

dbcandle commented 3 years ago

A W E S O M E ! ! !

Sent from my iPad

On Feb 18, 2021, at 3:45 PM, Chris van den Berg notifications@github.com wrote:

This is now supported out of the box with the optional "id" variables.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Bergvca / string_grouper

match_strings(): Any way to return additional columns? #8