Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

Question: How to have built StringGrouper corpus persist across multiple match_string calls in a programming session #69

Open justasojourner opened 3 years ago

justasojourner commented 3 years ago

Hi this the logical progression, next step, to the clean-up process of data extracted from a database having an ID and name field. It's a follow on from the use case documented in the first String Grouper tutorial which I wrote.

Background

To recap the requirement back then — there was a reasonably large database of accounts with an ID and customer name. Browsing the database it could be seen by eye that there were many, many duplicates. Using String Grouper (with the added functionality to include IDs in the matching process) it is possible to (insanely) quickly get a list of IDs of duplicate records which are then imported back into the database to do a join with the original table. It's all in the tutorial.

The persons responsible are now starting the clean up necessary to remove all the duplicates, but now I want to setup a solution to ensure that when bulk imports are done (typically using Excel) in the future to this system that duplicates are proactively avoided. The web app does have functionality to identify duplicates — but it is very limited in functionality, and its lack is what got the database into having so many duplicates in the first place.

Process/Requirement

I will have a Python application which will open the Excel file, walk through the rows and do a number of checks and validations (data missing etc) and reject rows that are not clean. One of the checks will be for potential duplicates — the customer [name] already exists in the database, I have access to the database in PostgreSQL.

PostgreSQL Search Options

PostgreSQL has a number of search options like levenshtein, soundex and Trigrams as well as tsvector. I have tried the first three but one way or the other they don't give me the same duplication functionality I have seen with String Grouper using cosine similarities.

Possibility of Using String Grouper?

So I did a simple test in String Grouper importing the existing data (id & name) into a pandas DataFrame accounts and then making a one row DataFrame duplicates to hold the string being queried.

accounts = pd.read_csv('data/accounts.tsv', delimiter='\t', lineterminator='\n')

duplicates = pd.DataFrame(
    [
        ('1111', 'Some Company Ltd')
    ],
    columns=('duplicate_id', 'duplicate_name')
)

matches = match_strings(
    accounts['name'],
    duplicates['duplicate_name'],
    master_id=accounts['id'],
    duplicates_id=duplicates['duplicate_id'],
    ignore_index=True,
    max_n_matches=20,
    min_similarity=0.8
)

It works, however I see that each time I call the String Grouper match_strings function it takes some time as I believe it is rebuilding the corpus each time because I have to call the function each time I step through the rows in the tabular data being imported. Obviously this is not efficient.

Question

So, to the question. Is it possible to run String Grouper such that the corpus is built one time only for the duration of the Python running application. So:

  1. Initialise the Python program
  2. Load the dataframe with the many rows accounts
  3. Build the corpus — once only
  4. Loop:
    1. Step through the rows getting the candidate (potential duplicate) string name
    2. run match_strings for each row.
  5. End loop:

Thanks in advance for any help/guidance.

p.s. This might be good functionality to build into the package, it is likely a fairly common use case.

Elfilali-Taoufiq commented 2 years ago

hello,

any update about this issue plz ?

ParticularMiner commented 2 years ago

Hi @Elfilali-Taoufiq ,

Sorry. We forgot to close this issue, since this issue has now been fixed, as you'll find in the docs (follow this link):

... New in version 0.6.0: each of the high-level functions listed above also has a StringGrouper method counterpart of the same name and parameters. Calling such a method of any instance of StringGrouper will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons. The input Series to the method (master, duplicates, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus. For example:

# Build a corpus using strings in the pandas Series master:
sg = StringGrouper(master)
# The following method-calls will compare strings first in
# pandas Series new_master_1 and next in new_master_2
# using the corpus already built above without rebuilding or
# changing it in any way:
matches1 = sg.match_strings(new_master_1)
matches2 = sg.match_strings(new_master_2)
ParticularMiner commented 2 years ago

@Elfilali-Taoufiq

I admit these new features are not sufficiently documented, so plz feel free to ask for further clarification if you need it.

asnorthrup commented 2 years ago

Hi - awesome library,

@ParticularMiner - I'm having a little trouble with understanding this functionality. It sounds a lot like what I was hoping for, but I have some doubts after playing with it. Lets say master has ~100k records. sg = StringGrouper(master)

I was envisioning functionality where you could then say something like, in the case that series to find matches was 100 records: sg.match_strings(series_to_match_against_master)

to quickly find matches in master (assuming master is much larger than 'series of str to match against' is much smaller). This would be really helpful if you only had 1 to match against master (say user input).

It seems like this function, if you pass in a master and duplicates (e.g. sg.match_strings(master=master, duplicates=series_to_match_against)), rebuilds the tf-idf using a new master, reincodes master and duplicates and then finds the matches (but does skip rebuilding the corpus off of master). Is there any way to use this to not rebuild tf-idf and reincode master? Just encode duplicates and find matches? maybe do something like sg.match_strings(duplicates=series_to_match_against_master)?

ParticularMiner commented 2 years ago

Hi @asnorthrup !

Thank you for your interest in string_grouper.

I see what you mean: your use case involves multiple small queries of duplicate-strings against a large master dataset which remains constant across the queries.

I can also see how one might benefit from "caching" in RAM the entire "master matrix" during such queries. But then, obviously, that matrix would also need to be small enough to fit completely into RAM.

You are correct in your observation that we have not anticipated this use case. String-Grouper's current performance is simply based on the observation that for large datasets it is (by orders of magnitude) faster to rebuild a matrix from an already built corpus than it is to rebuild the corpus from scratch and from that rebuild the matrix too. Still, I agree with you that in your use-case (with a small enough master dataset) there may exist still more performant paths yet to be explored.

See #83 or re-install string_grouper using

pip install git+https://github.com/ParticularMiner/string_grouper.git@cache

for a tentative fix.