Closed campbellalex321 closed 2 years ago
I guess for an example I have this code here
matches = match_strings(df_chunks['common_name'], keywords, n_blocks=(1, 200), min_similarity = .45)
which matches the rows in the df df_chunks common_name column to the Series object that I got from parsing keywords, but what I want is match scores (if there is any) with the source columns in the df_chunks df.
@campbellalex321
We may be in different time-zones: I went to bed for the night. Sorry for the delayed response.
I'm not sure if I understood your question correctly. But I will say this: if you want any column(s) of your source DataFrame
's to appear as keys in the output from match_strings()
, simply use the .set_index()
method of those source DataFrame
's on those columns prior to calling match_strings()
.
For example,
import pandas as pd
from string_grouper import match_strings
# Create a small set of artificial customer names:
customers_df = pd.DataFrame(
[
('BB016741P', 'Mega Enterprises Corporation'),
('CC082744L', 'Hyper Startup Incorporated'),
('AA098762D', 'Hyper Startup Inc.'),
('BB099931J', 'Hyper-Startup Inc.'),
('HH072982K', 'Hyper Hyper Inc.')
],
columns=('Customer ID', 'Customer Name')
)
# Display the data:
customers_df
Customer ID | Customer Name | |
---|---|---|
0 | BB016741P | Mega Enterprises Corporation |
1 | CC082744L | Hyper Startup Incorporated |
2 | AA098762D | Hyper Startup Inc. |
3 | BB099931J | Hyper-Startup Inc. |
4 | HH072982K | Hyper Hyper Inc. |
customers_df = customers_df.set_index('Customer ID')
keywords = pd.DataFrame(
[
('KW_BB016741P', 'Mega Enterprises Corporation'),
('KW_CC082744L', 'Hyper Startup Incorporated'),
('KW_HH072982K', 'Hyper Hyper Inc.')
],
columns=('Keyword ID', 'Keyword')
)
keywords = keywords.set_index('Keyword ID')
# turn the `DataFrame` into a `Series`:
keywords=keywords.squeeze(1)
# Display the data:
keywords
Keyword ID
KW_BB016741P Mega Enterprises Corporation
KW_CC082744L Hyper Startup Incorporated
KW_HH072982K Hyper Hyper Inc.
Name: Keyword, dtype: object
match_strings(customers_df['Customer Name'], keywords, n_blocks=(1, 200), min_similarity = .45)
left_Customer ID | left_Customer Name | similarity | right_Keyword | right_Keyword ID | |
---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | 1.000000 | Mega Enterprises Corporation | KW_BB016741P |
1 | CC082744L | Hyper Startup Incorporated | 1.000000 | Hyper Startup Incorporated | KW_CC082744L |
2 | AA098762D | Hyper Startup Inc. | 0.710885 | Hyper Startup Incorporated | KW_CC082744L |
3 | BB099931J | Hyper-Startup Inc. | 0.710885 | Hyper Startup Incorporated | KW_CC082744L |
4 | HH072982K | Hyper Hyper Inc. | 1.000000 | Hyper Hyper Inc. | KW_HH072982K |
A similar example can be found in the docs. I hope this answers your question.
@campbellalex321
Perhaps I misunderstood your question: see if Red String Grouper meets your needs:
Record Equivalence Discoverer based on String Grouper (Red String Grouper) is a python package that finds similarities between rows/records of a table with multiple fields. It is an extension of String Grouper ...
Thanks @ParticularMiner I tried to apply what you mentioned but I am still running into issues. Do both dataframes need to have the set index on them applied? Because the base dataframe I'm comparing to only has one column. So here is my workflow, I get a list of keywords that I parse from a column in an Excel column and then I have a sql file query that goes through and tries to match these keywords. Here is the code that I have let me know if you need more.
print("connection received")
keywords = keywords.squeeze(1)
print(keywords)
df_chunks = pd.read_sql_query(query, conn)
df_chunks.set_index('ord_id')
# The below code will take the longest part of the domain and assign it as a common name
# I know this will have some pitfalls but hopefully not many, most domains are longer than
# the prefix and suffix.
df_chunks['common_name'] = df_chunks['common_name'].apply(lambda d: max(d.split('.'), key=len))
print(df_chunks)
matches = match_strings(df_chunks['common_name'], keywords, n_blocks=(1, 200), min_similarity = .15)
pd.set_option('max_columns', None)
print(matches)
The df chunks is the query results and the keywords is the results from the Excel file.
Hi @campbellalex321
What error do you get when you set the index of only one data set? It should work regardless of whether you set the index of only one data set or not. (See also this link.)
By the way, for readability, if you want to post a python code-snippet, enclose it in triple quotes like this, for example:
```python # code here ```
to get:
# code here
No error, just not pulling in through the columns that I want. Also let's say I want more than one column, would I just add that column to the index. Also thanks on the hint on the code, I was trying to figure that out, but couldn't find it in the shortcuts. Code has been updated so it looks easier to read.
@campbellalex321
I'm not sure what the problem is here without seeing the data itself. All I can think of is that the resulting strings after applying the lambda
function are not very similar to those in the keywords
Series; or that your data sets are very small. Try a very small similarity threshold, say min_similarity=0.001
.
I'm still getting matches with .15, I'm just not able to have that column that I need in the source column. I'll keep tinkering with it. Thanks for the help.
@campbellalex321
Which column is that? I expect ord_id
to be in the output.
right now I'm getting left_index, right_index columns looks like just random numbers, left_common_name similairity, and right_keyword. It looks like the only columns being returned are the ones being matched.
oh geez :) I figured it out. I needed to do df_chunks = df_chunks.set_index('ord_id')
geez :D
Thanks for the help!
Good. I was just about to tell you to do that. 😆
Sorry I keep bugging you about your library but thanks for your quick responses. So I have a table that I'm searching through and a list of key words that I'm searching against. I want to include some sort of identifier from the table data in the resultant dataframe so that I can take that primary key and get more information from the table if I get any hits. Right now I just have the key word match, and since it's not a primary key I can just join that to the original table or it may take a minute and result in many more records than I need. So is it possible to have the left word match word, similarity score, right word match_word and then left word primary key as a column? Or would I have to join using the search words I'm matching against? It may be easier to just include all the columns from the left table that I care about to ensure I don't have to go back to the source, but not sure if that will slow things down.