Open justasojourner opened 3 years ago
hello,
any update about this issue plz ?
Hi @Elfilali-Taoufiq ,
Sorry. We forgot to close this issue, since this issue has now been fixed, as you'll find in the docs (follow this link):
... New in version 0.6.0: each of the high-level functions listed above also has a
StringGrouper
method counterpart of the same name and parameters. Calling such a method of any instance ofStringGrouper
will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons. The input Series to the method (master
,duplicates
, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus. For example:# Build a corpus using strings in the pandas Series master: sg = StringGrouper(master) # The following method-calls will compare strings first in # pandas Series new_master_1 and next in new_master_2 # using the corpus already built above without rebuilding or # changing it in any way: matches1 = sg.match_strings(new_master_1) matches2 = sg.match_strings(new_master_2)
@Elfilali-Taoufiq
I admit these new features are not sufficiently documented, so plz feel free to ask for further clarification if you need it.
Hi - awesome library,
@ParticularMiner - I'm having a little trouble with understanding this functionality. It sounds a lot like what I was hoping for, but I have some doubts after playing with it. Lets say master has ~100k records. sg = StringGrouper(master)
I was envisioning functionality where you could then say something like, in the case that series to find matches was 100 records: sg.match_strings(series_to_match_against_master)
to quickly find matches in master (assuming master is much larger than 'series of str to match against' is much smaller). This would be really helpful if you only had 1 to match against master (say user input).
It seems like this function, if you pass in a master and duplicates (e.g. sg.match_strings(master=master, duplicates=series_to_match_against)), rebuilds the tf-idf using a new master, reincodes master and duplicates and then finds the matches (but does skip rebuilding the corpus off of master). Is there any way to use this to not rebuild tf-idf and reincode master? Just encode duplicates and find matches? maybe do something like sg.match_strings(duplicates=series_to_match_against_master)?
Hi @asnorthrup !
Thank you for your interest in string_grouper
.
I see what you mean: your use case involves multiple small queries of duplicate-strings against a large master
dataset which remains constant across the queries.
I can also see how one might benefit from "caching" in RAM the entire "master matrix" during such queries. But then, obviously, that matrix would also need to be small enough to fit completely into RAM.
You are correct in your observation that we have not anticipated this use case. String-Grouper's current performance is simply based on the observation that for large datasets it is (by orders of magnitude) faster to rebuild a matrix from an already built corpus than it is to rebuild the corpus from scratch and from that rebuild the matrix too. Still, I agree with you that in your use-case (with a small enough master
dataset) there may exist still more performant paths yet to be explored.
See #83 or re-install string_grouper
using
pip install git+https://github.com/ParticularMiner/string_grouper.git@cache
for a tentative fix.
Hi this the logical progression, next step, to the clean-up process of data extracted from a database having an ID and name field. It's a follow on from the use case documented in the first String Grouper tutorial which I wrote.
Background
To recap the requirement back then — there was a reasonably large database of accounts with an ID and customer name. Browsing the database it could be seen by eye that there were many, many duplicates. Using String Grouper (with the added functionality to include IDs in the matching process) it is possible to (insanely) quickly get a list of IDs of duplicate records which are then imported back into the database to do a join with the original table. It's all in the tutorial.
The persons responsible are now starting the clean up necessary to remove all the duplicates, but now I want to setup a solution to ensure that when bulk imports are done (typically using Excel) in the future to this system that duplicates are proactively avoided. The web app does have functionality to identify duplicates — but it is very limited in functionality, and its lack is what got the database into having so many duplicates in the first place.
Process/Requirement
I will have a Python application which will open the Excel file, walk through the rows and do a number of checks and validations (data missing etc) and reject rows that are not clean. One of the checks will be for potential duplicates — the customer [name] already exists in the database, I have access to the database in PostgreSQL.
PostgreSQL Search Options
PostgreSQL has a number of search options like
levenshtein
,soundex
and Trigrams as well astsvector
. I have tried the first three but one way or the other they don't give me the same duplication functionality I have seen with String Grouper using cosine similarities.Possibility of Using String Grouper?
So I did a simple test in String Grouper importing the existing data (id & name) into a pandas DataFrame
accounts
and then making a one row DataFrameduplicates
to hold the string being queried.It works, however I see that each time I call the String Grouper
match_strings
function it takes some time as I believe it is rebuilding the corpus each time because I have to call the function each time I step through the rows in the tabular data being imported. Obviously this is not efficient.Question
So, to the question. Is it possible to run String Grouper such that the corpus is built one time only for the duration of the Python running application. So:
accounts
name
match_strings
for each row.Thanks in advance for any help/guidance.
p.s. This might be good functionality to build into the package, it is likely a fairly common use case.