using string_grouper with a lookup column in orginal source data

campbellalex321 commented 2 years ago

Sorry I keep bugging you about your library but thanks for your quick responses. So I have a table that I'm searching through and a list of key words that I'm searching against. I want to include some sort of identifier from the table data in the resultant dataframe so that I can take that primary key and get more information from the table if I get any hits. Right now I just have the key word match, and since it's not a primary key I can just join that to the original table or it may take a minute and result in many more records than I need. So is it possible to have the left word match word, similarity score, right word match_word and then left word primary key as a column? Or would I have to join using the search words I'm matching against? It may be easier to just include all the columns from the left table that I care about to ensure I don't have to go back to the source, but not sure if that will slow things down.

campbellalex321 commented 2 years ago

I guess for an example I have this code here matches = match_strings(df_chunks['common_name'], keywords, n_blocks=(1, 200), min_similarity = .45) which matches the rows in the df df_chunks common_name column to the Series object that I got from parsing keywords, but what I want is match scores (if there is any) with the source columns in the df_chunks df.

ParticularMiner commented 2 years ago

@campbellalex321

We may be in different time-zones: I went to bed for the night. Sorry for the delayed response.

I'm not sure if I understood your question correctly. But I will say this: if you want any column(s) of your source DataFrame's to appear as keys in the output from match_strings(), simply use the .set_index() method of those source DataFrame's on those columns prior to calling match_strings().

For example,

import pandas as pd
from string_grouper import match_strings

# Create a small set of artificial customer names:
customers_df = pd.DataFrame(
   [
      ('BB016741P', 'Mega Enterprises Corporation'),
      ('CC082744L', 'Hyper Startup Incorporated'),
      ('AA098762D', 'Hyper Startup Inc.'),
      ('BB099931J', 'Hyper-Startup Inc.'),
      ('HH072982K', 'Hyper Hyper Inc.')
   ],
   columns=('Customer ID', 'Customer Name')
)
# Display the data:
customers_df

	Customer ID	Customer Name
0	BB016741P	Mega Enterprises Corporation
1	CC082744L	Hyper Startup Incorporated
2	AA098762D	Hyper Startup Inc.
3	BB099931J	Hyper-Startup Inc.
4	HH072982K	Hyper Hyper Inc.

customers_df = customers_df.set_index('Customer ID')

keywords = pd.DataFrame(
   [
      ('KW_BB016741P', 'Mega Enterprises Corporation'),
      ('KW_CC082744L', 'Hyper Startup Incorporated'),
      ('KW_HH072982K', 'Hyper Hyper Inc.')
   ],
   columns=('Keyword ID', 'Keyword')
)
keywords = keywords.set_index('Keyword ID')
# turn the `DataFrame` into a `Series`:
keywords=keywords.squeeze(1)
# Display the data:
keywords

Keyword ID
KW_BB016741P    Mega Enterprises Corporation
KW_CC082744L      Hyper Startup Incorporated
KW_HH072982K                Hyper Hyper Inc.
Name: Keyword, dtype: object

match_strings(customers_df['Customer Name'], keywords, n_blocks=(1, 200), min_similarity = .45)

	left_Customer ID	left_Customer Name	similarity	right_Keyword	right_Keyword ID
0	BB016741P	Mega Enterprises Corporation	1.000000	Mega Enterprises Corporation	KW_BB016741P
1	CC082744L	Hyper Startup Incorporated	1.000000	Hyper Startup Incorporated	KW_CC082744L
2	AA098762D	Hyper Startup Inc.	0.710885	Hyper Startup Incorporated	KW_CC082744L
3	BB099931J	Hyper-Startup Inc.	0.710885	Hyper Startup Incorporated	KW_CC082744L
4	HH072982K	Hyper Hyper Inc.	1.000000	Hyper Hyper Inc.	KW_HH072982K

A similar example can be found in the docs. I hope this answers your question.

ParticularMiner commented 2 years ago

@campbellalex321

Perhaps I misunderstood your question: see if Red String Grouper meets your needs:

Record Equivalence Discoverer based on String Grouper (Red String Grouper) is a python package that finds similarities between rows/records of a table with multiple fields. It is an extension of String Grouper ...

campbellalex321 commented 2 years ago

Thanks @ParticularMiner I tried to apply what you mentioned but I am still running into issues. Do both dataframes need to have the set index on them applied? Because the base dataframe I'm comparing to only has one column. So here is my workflow, I get a list of keywords that I parse from a column in an Excel column and then I have a sql file query that goes through and tries to match these keywords. Here is the code that I have let me know if you need more.

    print("connection received")
    keywords = keywords.squeeze(1)
    print(keywords)
    df_chunks = pd.read_sql_query(query, conn)
    df_chunks.set_index('ord_id')
    # The below code will take the longest part of the domain and assign it as a common name 
    # I know this will have some pitfalls but hopefully not many, most domains are longer than
    # the prefix and suffix. 
    df_chunks['common_name'] = df_chunks['common_name'].apply(lambda d: max(d.split('.'), key=len))
    print(df_chunks)
    matches = match_strings(df_chunks['common_name'], keywords, n_blocks=(1, 200), min_similarity = .15)
    pd.set_option('max_columns', None)
    print(matches)

The df chunks is the query results and the keywords is the results from the Excel file.

ParticularMiner commented 2 years ago

Hi @campbellalex321

What error do you get when you set the index of only one data set? It should work regardless of whether you set the index of only one data set or not. (See also this link.)

By the way, for readability, if you want to post a python code-snippet, enclose it in triple quotes like this, for example:

```python # code here ```

to get:

# code here

campbellalex321 commented 2 years ago

No error, just not pulling in through the columns that I want. Also let's say I want more than one column, would I just add that column to the index. Also thanks on the hint on the code, I was trying to figure that out, but couldn't find it in the shortcuts. Code has been updated so it looks easier to read.

ParticularMiner commented 2 years ago

@campbellalex321

I'm not sure what the problem is here without seeing the data itself. All I can think of is that the resulting strings after applying the lambda function are not very similar to those in the keywords Series; or that your data sets are very small. Try a very small similarity threshold, say min_similarity=0.001.

campbellalex321 commented 2 years ago

I'm still getting matches with .15, I'm just not able to have that column that I need in the source column. I'll keep tinkering with it. Thanks for the help.

ParticularMiner commented 2 years ago

@campbellalex321

Which column is that? I expect ord_id to be in the output.

campbellalex321 commented 2 years ago

right now I'm getting left_index, right_index columns looks like just random numbers, left_common_name similairity, and right_keyword. It looks like the only columns being returned are the ones being matched.

campbellalex321 commented 2 years ago

oh geez :) I figured it out. I needed to do df_chunks = df_chunks.set_index('ord_id') geez :D

campbellalex321 commented 2 years ago

Thanks for the help!

ParticularMiner commented 2 years ago

Good. I was just about to tell you to do that. 😆

Bergvca / string_grouper

using string_grouper with a lookup column in orginal source data #79