Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

Remaining ID functionality added #25

Closed ParticularMiner closed 3 years ago

ParticularMiner commented 3 years ago

cc: @justasojourner Hi @Bergvca,

As promised in Issue #22, I've added the the IDs to the remaining 2 functions. I hope it matches your expectations.

Cheers!

justasojourner commented 3 years ago

The Problem

When I hear the word 'grouped' considered in the context of a database table, which is what we are considering, then in this case that says to me that there will be data loss.

If we start from the explanation for the String Grouper group_similar_strings function:

For example the input series: [foooo, foooob, bar] will return [foooo, foooo, bar]. Here foooo and foooob are grouped together into group foooo because they are found to be similar.

That means the unique value 'foooob' disappears, it is 'grouped' into 'foooo' (it no longer exists). This is fine if we are talking about matching strings but we cannot, by definition, group two unique database IDs into one.

Example

I will use the simple data set from the tutorial I wrote, it has enough rows to illustrate the point.

id name
0 AA098762D hyper startup inc.
1 BB099931J hyper-startup inc.
2 CC082744L hyper startup incorporated
3 HH072982K hyper hyper inc.

So, let us assume a min_similarity setting of 0.7, and that indices 0 through 2 are matched as having similar strings in the name column. If the value 'hyper startup inc.' is considered to be the primary, then the other values are 'grouped' into it. But the primary record (index 0) has the (database) ID 'AA098762D' along with its value 'hyper startup inc.'. So if the two other values ('hyper-startup inc.', 'hyper startup incorporated') are grouped into it then what happens to their (database) IDs?

Result 1 — Do we get this?

id name
0 AA098762D hyper startup inc.
3 HH072982K hyper hyper inc.

Result 2 — Or this?

id name
0 AA098762D hyper startup inc.
1 BB099931J hyper startup inc.
2 CC082744L hyper startup inc.
3 HH072982K hyper hyper inc.

Result 3 — Or this?

id name
0 AA098762D hyper startup inc.
1 AA098762D hyper-startup inc.
2 AA098762D hyper startup incorporated
3 HH072982K hyper hyper inc.

If we follow the current logic of the group_similar_strings function I would assume we get Result 2. But then the database IDs 'BB099931J' & 'CC082744L' no longer match their original values.

Question

So my question is — if (database) IDs are included, and a grouping of similar strings is done, then in what form will rows be returned? Further, if Result 2 is returned and the original values (names) overwritten, then how will that resulting data set be utilised back in the original database? I have an idea how, but I am interested to hear how you both envisage it will work.

Also it is assumed that the row having ID 'AA098762D' is the lead, the most desired, record having the preferred name. But what if ID 'CC082744L' has the better (more desirable) name, or ID 'BB099931J'?

I think it needs to be clearly understood what is the expected/desired returned data — and the coding done to that goal (if it hasn't yet been).

ParticularMiner commented 3 years ago

Hi @justasojourner,

Just saw your message. Thanks for the clear illustration of the problem.

I'm assuming here that the idea of "deduplication" is to remove unnecessary pieces of data that may mislead the querier into thinking that those pieces are each unique when in fact they ought to be referring to the same entity. For instance, the database maintainer may dislike the fact that there are multiple different IDs (as in your example input data) referring to the same customer only because the one who input the data committed a typo and failed to recognize that his/her new customer entries were in fact not new to the database but already existed in the database.

So the actual solution should be closer to Result 3 in your message, though, depending on how you use the String Grouper functions, data doesn't need to be lost. This is because the user may choose not to immediately replace the original ID column with the group IDs obtained from String Grouper, but instead create a new column, say "deduplicated CID", alongside the original ID column, like this:

transactions["deduplicated CID"] = group_similar_strings(transactions["customer name"], transactions["customer ID"])

rather than this:

transactions["customer ID"] = group_similar_strings(transactions["customer name"], transactions["customer ID"])

Next, the user may use his/her own means to check whether each of these new group IDs is truly appropriate for its row and replace it with a proper alternative if necessary. The particular form this check/correction would take is beyond the scope of String Grouper's objectives. But clearly, only after that check/correction is done will the user be confident enough to replace the original ID column with the group ID column, thus avoiding data loss.

I suppose the power of using String Grouper here is to quickly provide the user with a starting point for data clean-up by quickly identifying groups of ID entries that probably should have been the same based on similarities in spelling between their corresponding company-name entries.

HTH

ParticularMiner commented 3 years ago

Furthermore, I'm pretty sure a similar result could be achieved by hand (that is, without String Grouper's grouping function) in a laborious roundabout way using the output of the match_strings function. But it is possible to miss certain matches as warned in README.md (now paraphrased here):

... for example, the output of match_strings may show that A matches B; and B matches C; but the same output may not show that A matches C; nor C matches A (where A, B and C are data values in the column of strings to be grouped).

Yet it is quite possible that A and C ought to match too.

Note that in this example, the "distance" between A and C is just one data value, namely B. The grouping function of String Grouper is meant to resolve all such missing associativities, no matter how distant (even by more than one data value), in one go, thus sparing the user time and effort.

justasojourner commented 3 years ago

Hi

Again I have to say that whenever I hear the word "change" with respect to IDs (which means database ids) I have a fundamental issue because, if we consider that we are looking at the use case where we are using String Grouper to find duplicates in a database, then the point is database ids are immutable. They absolutely cannot be changed, or created arbitrarily (outside of the source database system). So any point I make is predicated on that fundamental fact.

At the risk of labouring the point, and explaining the obvious, I will explain this graphically.

A database system (attempts to) match the real world: database-and-grouping so we have a primary key (PK) which is required to be not null and unique. When the database starts out nice and clean we have this:

id name
0 BB016741P Mega Enterprises Corporation
1 CC082744L Hyper Startup Incorporated
2 BB760431Y A.B.C. Enterprises Incorporated

But as time goes on and duplicates are introduced in error we end up with this:

id name
0 BB016741P Mega Enterprises Corporation
1 CC082744L Hyper Startup Incorporated
2 AA098762D Hyper Startup Inc.
3 BB099931J Hyper-Startup Inc.
4 HH072982K Hyper Hyper Inc.

The correct database entity matching the real-world entity is the record having the ID = 'CC082744L', all the rest ('AA098762D', 'BB099931J', 'HH072982K') are pollution, they have to go. But... they have to be removed (dealt with) in the database, not String Grouper. String Grouper's role is to highlight the fact that the name of the duplicated customer record is close to the name of the legitimate record while providing the ID of the duplicate record - it is not to touch (change/delete/add) the ID.

If String Grouper 'groups', that is removes, database IDs then data is lost.

I have no problem with finding matches that are found via intermediate values, so A...B <-> B...C therefore there is a match between A and C.

... for example, the output of match_strings may show that A matches B; and B matches C; but the same output may not show that A matches C; nor C matches A (where A, B and C are data values in the column of strings to be grouped).

Great, the more duplicates found the better.

But this cannot be at the risk of touching IDs and/or unlinking IDs from their original strings (the name field).

Because it gets worse... the master data (Customers) is linked to by a transaction table, Orders, which has data like this:

id customer_id sales_amt
1 5672 CC082744L 12,345
2 5699 AA098762D 123,456
3 6038 BB099931J 1,234,567
4 6155 HH072982K 12,345,678

So the database record matching the real-world customer, 'CC082744L', has revenue of only € 12,345 whereas € 13,703,701 goes to all the incorrect duplicates.

So when de-duplication is done in String Grouper all the unique IDs of the records having duplicate, invalid, names have to then come across to the database because there will be the need to do cleanup across multiple tables.

Whereas it appears when you say this:

This is because the user may choose not to immediately replace the original ID column with the group IDs obtained from String Grouper,

that String Grouper is creating IDs? IDs cannot be created in String Grouper, they are immutable and come from the source database.

So, having said all the above, my question still is (if after adding IDs to the remaining two functions) how will this be done, that is what will be the function calling convention, and what form will the returned data set have?

ParticularMiner commented 3 years ago

It seems you have just said in a different way what I mentioned before. Indeed String Grouper need not attempt to change any database keys by itself (in fact, it is generally safer not to let it do so), but rather it may suggest changes to keys.

In this case, String Grouper would be available to the database administrator himself/herself who has the rights to temporarily suspend key-immutability in order to perform the clean-up. (I’m speaking from my own experience as a database administrator many years ago before I quit to pursue a scientific career.)

The job of the administrator is to resolve all the dependencies between all the tables in the database as he/she makes the index changes during the clean-up. String Grouper is therefore for him/her a useful guiding tool.

With regards to your example transactional data, obviously after the administrator is doubly sure that customer IDs AA098762D, BB099931J, and HH072982K are actually the same as the customer with ID CC082744L (as suggested by String Grouper), he/she might find it expedient to temporarily suspend key-immutability and the inter-relationships between tables wherever necessary throughout the entire database in order to effect the ID changes wherever necessary in one fell swoop using his/her own tools. After this surgical operation, the customer IDs AA098762D, BB099931J, and HH072982K may be safely removed from the customers table. This is what “clean-up” sometimes means.

Obviously, such an operation must occur within a period during which no one else is allowed to make entries into the database (or it is simply taken offline), and more importantly, only after a backup is made of the database (just in case it later turns out that String Grouper gave a wrong suggestion during the cleanup).

But you know, it is not only the database administrator who might require String Grouper. A database developer too may find it useful in creating data reports, say one that shows the total revenue for a particular customer, taking into account all the possible IDs that the customer might have in the database (through no fault of his/hers). In this case, grouping the transactional data by String Grouper’s suggestions inserted in a new column in the transaction table would rather easily provide the necessary sums.

ParticularMiner commented 3 years ago

You have already seen one example of a function call in one of my previous messages:

transactions["deduplicated CID"] = group_similar_strings(transactions["customer name"], transactions["customer ID"])

The function group_similar_strings may be called as previously described in README.md:

  1. output = group_similar_strings(master)

    Now, it may also be called as follows:

  2. output = group_similar_strings(master, master_id)

where parameters master and master_id have their usual meanings. The usual optional keyword arguments may be appended to these parameters in either call as desired. output is always a pandas Series with default index the same length as master, except that in the first case strings are returned while in the second case their corresponding IDs from master_id are returned.

A similar pattern exists for the match_most_similar function:

  1. output = match_most_similar(master, duplicates)
  2. output = match_most_similar(master, duplicates, master_id, duplicates_id)

In both cases output is a pandas Series the same length as duplicates. In the first case, output has values from both master and duplicates as previously described in README.md, while in the second case their corresponding IDs from master_id and duplicates_id are returned.

Bergvca commented 3 years ago

Agreed with @ParticularMiner that the output of the grouping functions should be used as a mapping table.

The output would be something like:

id name name_deduped goup-id
BB016741P Mega Enterprises Corporation Mega Enterprises Corporation BB016741P
CC082744L Hyper Startup Incorporated Hyper Startup Incorporated CC082744L
AA098762D Hyper Startup Inc. Hyper Startup Incorporated CC082744L
BB099931J Hyper-Startup Inc. Hyper Startup Incorporated CC082744L
HH072982K Hyper Hyper Inc. Hyper Startup Incorporated CC082744L

How the user then uses this mapping table is up to them of course.

Ofcourse the StringGrouper doesn't always return the correct results, which is why the add_match / remove_match functions exist.

justasojourner commented 3 years ago

The output would be something like:

id name name_deduped goup-id BB016741P Mega Enterprises Corporation Mega Enterprises Corporation BB016741P CC082744L Hyper Startup Incorporated Hyper Startup Incorporated CC082744L AA098762D Hyper Startup Inc. Hyper Startup Incorporated CC082744L BB099931J Hyper-Startup Inc. Hyper Startup Incorporated CC082744L HH072982K Hyper Hyper Inc. Hyper Startup Incorporated CC082744L

That's what I was looking for, as long as the (database) IDs are kept matched to their original strings and the IDs are not changed in any way then there is no problem. Of course in database terms we already have pretty well the same functionality in match_strings, but more choices are fine.

However @Bergvca I have noticed separately in the commit comments my name and personal home email address were added. I don't know how it got in there. Can you remove it??

p.s. I just did some research and it appears that if someone commits up to Github with a different local Git email address to their Github account email then Github treats that as a second person and won't hide the email address. I added my home email address to my Github account but my name and personal email is still showing in the commit message.

justasojourner commented 3 years ago

@ParticularMiner - your personal email address is also showing on the pull request you submitted.

ParticularMiner commented 3 years ago

Hi @Bergvca ,

Thank you for your email. Succinctly put!

I've just resolved the conflicts that arose between this branch and yours after your last PR acceptance. This branch is now ready to merge with your master branch.

Now all that's left is the documentation for this additional ID functionality -- assuming of course that you approve of these code changes. I wonder, is @justasojourner interested in writing the documentation or should I write it this time?

justasojourner commented 3 years ago

Hi @Bergvca - @ParticularMiner and I have done detailed investigation on this. Apart from our names and personal email addresses appearing on the latest commit messages we see them in every commit (metadata).

We (I) were under the impression that if we selected email security in GitHub settings that our emails would not be pushed to remote. It seems not... I did have some conversation with GitHub support, and though I can see what they are saying I did give them feedback that the system is quite a bit ambiguous/misleading. But anyway...

We really would like to have our personal data removed, at the same time we would (apologetically) not wish to disrupt you too much.

There are two options:

  1. I have been in touch with GitHub support, they have given me a script using git filter-branch that can edit all commits to remove sensitive information. OK... I do read on the Git website that they say be careful with that command. You would have to execute the command and other tasks as the repo admin.
  2. You run git reset --hard HEAD to rewind back to the commit of 12 Oct 2020 and we (ParticularMiner and I) do the task of redoing a fork and submitting things again (commit -> PR) which can also be done as one commit from each of us. This will have the added benefit of making commits into the master branch more 'pretty'.

We are perfectly happy to do option 2, which will be the lesser work for you. If were OK doing option 1 I can forward you the email from GitHub support with full instructions.

Please let us know what you think/would like to do.

ParticularMiner commented 3 years ago

Hi @Bergvca and @justasojourner,

So, at the moment, for the sake of backward compatibility, the functions group_similar_strings and match_most_similar each return a nameless index-less pandas Series containing either strings or IDs depending on how they are called (see my previous comment).

Is this sufficient? Or would you prefer both IDs and strings to be output together instead of only IDs?

One way for the functions to do this and still return a Series would be to have them index the returning Series of strings with the IDs. Then, to directly insert the output Series as 2 new columns in an existing DataFrame, a typical group_similar_strings call would look like the following:

...
# If you want to directly insert the output of 
# the function call directly into new columns 
# of an existing DataFrame, then first reset
# the index of the DataFrame:
transactions = transactions.reset_index()

# Next, reset the index of the calling function:
transactions[["group-id", "name_deduped"]]  = \
    group_similar_strings(transactions["customer name"], transactions["customer ID"]).reset_index()
...

Note the 2nd reset_index() method call at the end (used to convert the output Series into a DataFrame).
Here the variable transactions is a pandas DataFrame with columns "customer name" and "customer ID" before the call; and it acquires 2 more columns "group-id" and "name_deduped" after the call:

customer ID customer name group-id name_deduped
BB016741P Mega Enterprises Corporation BB016741P Mega Enterprises Corporation
CC082744L Hyper Startup Incorporated CC082744L Hyper Startup Incorporated
AA098762D Hyper Startup Inc. CC082744L Hyper Startup Incorporated
BB099931J Hyper-Startup Inc. CC082744L Hyper Startup Incorporated
HH072982K Hyper Hyper Inc. CC082744L Hyper Startup Incorporated

This is not how the functions work at the moment. But would the above usage be acceptable?

Bergvca commented 3 years ago

Hi @justasojourner,

I'm fine with doing a git reset --hard to remove the sensitive information, if you both could re-create the pull requests. One thing to keep in mind is that I already pushed a new version to pypi. It would be nice if there was still a way to be able to go back to that version (0.2.2). I didn't add a label to the commit (will do this in the future) that was pushed to pypi, but it was based on the first PR you created (commit: 1eb924bb243334c0772b9f3387e05c829e5f9672). Is it possible for you to create the PR's in the same order? If not - that is also fine. We will just create a new version, but it on pypi, and label the commit that is linked to that new version.

Bergvca commented 3 years ago

Hi @ParticularMiner ,

Good point, I think for most users it would be best to have both the (group) ID and group string in the output, when using the optional ID column. From what I understand your proposal is to add the ID's as the Series Index. I think that makes sense.

What prevents us from just outputting a dataframe with two columns when the optional ID field is input? It feels less "clean", because we have two different output types, but it happens more often in the Pandas world. E.g. .loc sometimes outputs a Series and sometimes a DataFrame.

Funny enough this relates to my proposal here, were i suggested to set the ID column as index to input, and it would remain in the output.

ParticularMiner commented 3 years ago

Hi @Bergvca,

I see your previous comment indeed hints at this.

I've been thinking the C/C++ way all along -- that's my original background. Hence the strong typing ideals. It's taking me some time getting used to Python's weaker typing system.

Anyway, to return a Series in one case and a DataFrame in another shouldn't be too hard, I suppose. I'll try it.

justasojourner commented 3 years ago

Hi @Bergvca - thanks for understanding, no problem to do PRs in sequence. I notice though that the last two are not listed as PRs but just commits and you made us contributors. Were 67ac52b & 23806f7 PRs or straight commits??

Anyway we will do all the preparation (to make sure nothing is lost) and share with you and then you can do reset and we post relevant changes (PRs). We’ll do prep tomorrow our time so it’s ready for you your time tomorrow. Will confirm tomorrow that we are ready to proceed.

justasojourner commented 3 years ago

Basically thinking the plan would be to save the code at the commit/PRs snapshot points as OS files. Then delete our forked repos, then you do reset -HARD HEAD to go back to your commit of 12 October.

Then we 're-fork' from your repo, make branch, copy source files back over to branch, do just one commit (no need for many) and raise PR. Then the other two PRs can be done similarly one by one.

ParticularMiner commented 3 years ago

Hi @Bergvca and @justasojourner,

It's done.

As discussed, the functions group_similar_strings and match_most_similar can now (depending on how they are called) each return either a nameless index-less pandas Series containing strings or a DataFrame containing both strings and IDs.

To directly insert the output DataFrame as 2 new columns in an existing DataFrame, a typical group_similar_strings call would look like the following:

...
# If you want to directly insert the output of 
# the function call directly into new columns 
# of an existing DataFrame, then first reset
# the index of the DataFrame, if its index has
# been set:
transactions = transactions.reset_index()

# Next, call the function:
transactions[["group-id", "name_deduped"]]  = \
    group_similar_strings(transactions["customer name"], transactions["customer ID"])
...

Here the variable transactions is a pandas DataFrame with columns "customer name" and "customer ID" before the call; and it acquires 2 more columns "group-id" and "name_deduped" after the call:

customer ID customer name group-id name_deduped
BB016741P Mega Enterprises Corporation BB016741P Mega Enterprises Corporation
CC082744L Hyper Startup Incorporated CC082744L Hyper Startup Incorporated
AA098762D Hyper Startup Inc. CC082744L Hyper Startup Incorporated
BB099931J Hyper-Startup Inc. CC082744L Hyper Startup Incorporated
HH072982K Hyper Hyper Inc. CC082744L Hyper Startup Incorporated

Cheers!

justasojourner commented 3 years ago

Cleanup / reset of repo

Preparation

The goal of the preparation will be to ensure that no code/documentation is lost. The preparation part needs only to be done by @justasojourner and @ParticularMiner. The preparation steps should ensure that code/documents that were committed via various PRs can be submitted again in the same 'sequence'.

  1. @justasojourner will save to the OS files relevant to his first and second PR (which, arguably, could be reduced to one PR)
  2. @ParticularMiner will save to the OS files relevant to his PR, but please note he has made changes after the commit merged in PR by @Bergvca. If @Bergvca agrees with @ParticularMiner's later changes then @ParticularMiner can submit as one commit/PR. If @Bergvca would like to still review the later changes then @ParticularMiner can save two versions of the files and then submit the files as at the accepted PR and then submit the later changes in another PR.
  3. @justasojourner and @ParticularMiner will then delete their forks of @Bergvca repo. We will also rename local directory to keep as backup until task finished.

@Bergvca - can you advise if you want me to do two commits/PRs, or if I can fold the documentation changes into the same PR as the initial PR, that is the initial coding change.

Procedure

In the procedure @justasojourner will commit/PR first. We will avoid clashes because:

During the process @justasojourner and @ParticularMiner will double check that their Git name & email config are correct. Ha ha.

  1. @justasojourner and @ParticularMiner confirm they have done preparation.
  2. @Bergvca does a hard reset HEAD (or whichever way he wishes to roll back) of repo back to 12 October 2020 commit (making sure no sensitive data can be put back).
  3. @justasojourner only (at this point) will fork the repo again and clone to local Git repo
  4. @justasojourner will create a branch 'add_id' and push branch to remote.
  5. @justasojourner copies files for commit/PR that takes the repo up to the previous first PR (and includes the document/tutorial files from the second PR if @Bergvca agrees) and commits and pushes to repo/branch
  6. @justasojourner raises a PR
  7. @Bergvca to review the changes and merge into master if they are satisfactory. At this point:
    1. @Bergvca decides how he wishes to approach the PyPi question.
    2. The repo is at the point that @ParticularMiner can add his changes.
  8. IF @Bergvca wants @justasojourner to do two commits then @justasojourner will do a second commit/PR for the document/tutorial changes. @Bergvca will then review/accept/merge. Either way it will not affect @ParticularMiner (we are not touching the files each other are changing) who can start his process.
  9. @ParticularMiner forks the repo which now has the first PR merged.
  10. @ParticularMiner clones to local repo
  11. @ParticularMiner creates branch and pushes to remote
  12. @ParticularMiner layers his changed files (code + testing) into his local repo branch, commits and pushes to remote branch
  13. @ParticularMiner opens PR with his changes
  14. @Bergvca reviews/accepts/merges
  15. IF @Bergvca would like @ParticularMiner to follow the same process, changed files, as with the original commit/PR then @ParticularMiner will submit a second PR for the changes made after his first PR.

Then we should be good! Zo. Nu eerst een Bavaria.

justasojourner commented 3 years ago

@Bergvca - I've done my prep. copies of files made and both local & remote repos gone.

ParticularMiner commented 3 years ago

Hi @justasojourner, @Bergvca,

Whenever you are ready!

I have saved OS file snapshots of the project at

  1. The current stage of this very Pull Request #25 ("Remaining ID functionality added").
  2. The stage of Pull Request #23 ("Updated routine for increased performance")

I'll repeat the 'Open Pull Request' procedure to restore the above wherever required.

For completeness, I reiterate that @justasojourner is also able to

  1. restore the project to its last stage ("Add functionality to include an ID column to the string matching.": #19) just before @Bergvca last uploaded it to PyPi.
  2. restore all the new documentation and tutorials committed after the PyPi upload ("Update documentation for added ID parameter & tutorial written": #24).
Bergvca commented 3 years ago

Haha - wow, now I see this I think the script option might have been easier.. Anyway, the repo has been restored to the October commit! Go ahead to create the PR's.

I'm fine with puting the documentation in one of the other PRs, whatever is easiest for you.

justasojourner commented 3 years ago

OK, mine is all double checked and clean and the master branch of my fork

https://github.com/justasojourner/string_grouper

has been merge/synched back to

https://github.com/Bergvca/string_grouper

using an upstream remote.

Bergvca commented 3 years ago

Ok, all is done now I believe. Thanks for setting it up!

ParticularMiner commented 3 years ago

Thanks @Bergvca. You've been quite patient with us!