Closed ParticularMiner closed 3 years ago
When I hear the word 'grouped' considered in the context of a database table, which is what we are considering, then in this case that says to me that there will be data loss.
If we start from the explanation for the String Grouper group_similar_strings function:
For example the input series:
[foooo, foooob, bar]
will return[foooo, foooo, bar]
. Herefoooo
andfoooob
are grouped together into groupfoooo
because they are found to be similar.
That means the unique value 'foooob' disappears, it is 'grouped' into 'foooo' (it no longer exists). This is fine if we are talking about matching strings but we cannot, by definition, group two unique database IDs into one.
I will use the simple data set from the tutorial I wrote, it has enough rows to illustrate the point.
id | name | |
---|---|---|
0 | AA098762D | hyper startup inc. |
1 | BB099931J | hyper-startup inc. |
2 | CC082744L | hyper startup incorporated |
3 | HH072982K | hyper hyper inc. |
So, let us assume a min_similarity setting of 0.7, and that indices 0 through 2 are matched as having similar strings in the name column. If the value 'hyper startup inc.' is considered to be the primary, then the other values are 'grouped' into it. But the primary record (index 0) has the (database) ID 'AA098762D' along with its value 'hyper startup inc.'. So if the two other values ('hyper-startup inc.', 'hyper startup incorporated') are grouped into it then what happens to their (database) IDs?
Result 1 — Do we get this?
id | name | |
---|---|---|
0 | AA098762D | hyper startup inc. |
3 | HH072982K | hyper hyper inc. |
Result 2 — Or this?
id | name | |
---|---|---|
0 | AA098762D | hyper startup inc. |
1 | BB099931J | hyper startup inc. |
2 | CC082744L | hyper startup inc. |
3 | HH072982K | hyper hyper inc. |
Result 3 — Or this?
id | name | |
---|---|---|
0 | AA098762D | hyper startup inc. |
1 | AA098762D | hyper-startup inc. |
2 | AA098762D | hyper startup incorporated |
3 | HH072982K | hyper hyper inc. |
If we follow the current logic of the group_similar_strings function I would assume we get Result 2. But then the database IDs 'BB099931J' & 'CC082744L' no longer match their original values.
So my question is — if (database) IDs are included, and a grouping of similar strings is done, then in what form will rows be returned? Further, if Result 2 is returned and the original values (names) overwritten, then how will that resulting data set be utilised back in the original database? I have an idea how, but I am interested to hear how you both envisage it will work.
Also it is assumed that the row having ID 'AA098762D' is the lead, the most desired, record having the preferred name. But what if ID 'CC082744L' has the better (more desirable) name, or ID 'BB099931J'?
I think it needs to be clearly understood what is the expected/desired returned data — and the coding done to that goal (if it hasn't yet been).
Hi @justasojourner,
Just saw your message. Thanks for the clear illustration of the problem.
I'm assuming here that the idea of "deduplication" is to remove unnecessary pieces of data that may mislead the querier into thinking that those pieces are each unique when in fact they ought to be referring to the same entity. For instance, the database maintainer may dislike the fact that there are multiple different IDs (as in your example input data) referring to the same customer only because the one who input the data committed a typo and failed to recognize that his/her new customer entries were in fact not new to the database but already existed in the database.
So the actual solution should be closer to Result 3 in your message, though, depending on how you use the String Grouper functions, data doesn't need to be lost. This is because the user may choose not to immediately replace the original ID column with the group IDs obtained from String Grouper, but instead create a new column, say "deduplicated CID", alongside the original ID column, like this:
transactions["deduplicated CID"] = group_similar_strings(transactions["customer name"], transactions["customer ID"])
rather than this:
transactions["customer ID"] = group_similar_strings(transactions["customer name"], transactions["customer ID"])
Next, the user may use his/her own means to check whether each of these new group IDs is truly appropriate for its row and replace it with a proper alternative if necessary. The particular form this check/correction would take is beyond the scope of String Grouper's objectives. But clearly, only after that check/correction is done will the user be confident enough to replace the original ID column with the group ID column, thus avoiding data loss.
I suppose the power of using String Grouper here is to quickly provide the user with a starting point for data clean-up by quickly identifying groups of ID entries that probably should have been the same based on similarities in spelling between their corresponding company-name entries.
HTH
Furthermore, I'm pretty sure a similar result could be achieved by hand (that is, without String Grouper's grouping function) in a laborious roundabout way using the output of the match_strings
function. But it is possible to miss certain matches as warned in README.md (now paraphrased here):
... for example, the output of
match_strings
may show that A matches B; and B matches C; but the same output may not show that A matches C; nor C matches A (where A, B and C are data values in the column of strings to be grouped).
Yet it is quite possible that A and C ought to match too.
Note that in this example, the "distance" between A and C is just one data value, namely B. The grouping function of String Grouper is meant to resolve all such missing associativities, no matter how distant (even by more than one data value), in one go, thus sparing the user time and effort.
Hi
Again I have to say that whenever I hear the word "change" with respect to IDs (which means database ids) I have a fundamental issue because, if we consider that we are looking at the use case where we are using String Grouper to find duplicates in a database, then the point is database ids are immutable. They absolutely cannot be changed, or created arbitrarily (outside of the source database system). So any point I make is predicated on that fundamental fact.
At the risk of labouring the point, and explaining the obvious, I will explain this graphically.
A database system (attempts to) match the real world: so we have a primary key (PK) which is required to be not null and unique. When the database starts out nice and clean we have this:
id | name | |
---|---|---|
0 | BB016741P | Mega Enterprises Corporation |
1 | CC082744L | Hyper Startup Incorporated |
2 | BB760431Y | A.B.C. Enterprises Incorporated |
But as time goes on and duplicates are introduced in error we end up with this:
id | name | |
---|---|---|
0 | BB016741P | Mega Enterprises Corporation |
1 | CC082744L | Hyper Startup Incorporated |
2 | AA098762D | Hyper Startup Inc. |
3 | BB099931J | Hyper-Startup Inc. |
4 | HH072982K | Hyper Hyper Inc. |
The correct database entity matching the real-world entity is the record having the ID = 'CC082744L', all the rest ('AA098762D', 'BB099931J', 'HH072982K') are pollution, they have to go. But... they have to be removed (dealt with) in the database, not String Grouper. String Grouper's role is to highlight the fact that the name of the duplicated customer record is close to the name of the legitimate record while providing the ID of the duplicate record - it is not to touch (change/delete/add) the ID.
If String Grouper 'groups', that is removes, database IDs then data is lost.
I have no problem with finding matches that are found via intermediate values, so A...B <-> B...C therefore there is a match between A and C.
... for example, the output of match_strings may show that A matches B; and B matches C; but the same output may not show that A matches C; nor C matches A (where A, B and C are data values in the column of strings to be grouped).
Great, the more duplicates found the better.
But this cannot be at the risk of touching IDs and/or unlinking IDs from their original strings (the name field).
Because it gets worse... the master data (Customers) is linked to by a transaction table, Orders, which has data like this:
id | customer_id | sales_amt | |
---|---|---|---|
1 | 5672 | CC082744L | 12,345 |
2 | 5699 | AA098762D | 123,456 |
3 | 6038 | BB099931J | 1,234,567 |
4 | 6155 | HH072982K | 12,345,678 |
So the database record matching the real-world customer, 'CC082744L', has revenue of only € 12,345 whereas € 13,703,701 goes to all the incorrect duplicates.
So when de-duplication is done in String Grouper all the unique IDs of the records having duplicate, invalid, names have to then come across to the database because there will be the need to do cleanup across multiple tables.
Whereas it appears when you say this:
This is because the user may choose not to immediately replace the original ID column with the group IDs obtained from String Grouper,
that String Grouper is creating IDs? IDs cannot be created in String Grouper, they are immutable and come from the source database.
So, having said all the above, my question still is (if after adding IDs to the remaining two functions) how will this be done, that is what will be the function calling convention, and what form will the returned data set have?
It seems you have just said in a different way what I mentioned before. Indeed String Grouper need not attempt to change any database keys by itself (in fact, it is generally safer not to let it do so), but rather it may suggest changes to keys.
In this case, String Grouper would be available to the database administrator himself/herself who has the rights to temporarily suspend key-immutability in order to perform the clean-up. (I’m speaking from my own experience as a database administrator many years ago before I quit to pursue a scientific career.)
The job of the administrator is to resolve all the dependencies between all the tables in the database as he/she makes the index changes during the clean-up. String Grouper is therefore for him/her a useful guiding tool.
With regards to your example transactional data, obviously after the administrator is doubly sure that customer IDs AA098762D, BB099931J, and HH072982K are actually the same as the customer with ID CC082744L (as suggested by String Grouper), he/she might find it expedient to temporarily suspend key-immutability and the inter-relationships between tables wherever necessary throughout the entire database in order to effect the ID changes wherever necessary in one fell swoop using his/her own tools. After this surgical operation, the customer IDs AA098762D, BB099931J, and HH072982K may be safely removed from the customers table. This is what “clean-up” sometimes means.
Obviously, such an operation must occur within a period during which no one else is allowed to make entries into the database (or it is simply taken offline), and more importantly, only after a backup is made of the database (just in case it later turns out that String Grouper gave a wrong suggestion during the cleanup).
But you know, it is not only the database administrator who might require String Grouper. A database developer too may find it useful in creating data reports, say one that shows the total revenue for a particular customer, taking into account all the possible IDs that the customer might have in the database (through no fault of his/hers). In this case, grouping the transactional data by String Grouper’s suggestions inserted in a new column in the transaction table would rather easily provide the necessary sums.
You have already seen one example of a function call in one of my previous messages:
transactions["deduplicated CID"] = group_similar_strings(transactions["customer name"], transactions["customer ID"])
The function group_similar_strings
may be called as previously described in README.md:
output = group_similar_strings(master)
Now, it may also be called as follows:
output = group_similar_strings(master, master_id)
where parameters master
and master_id
have their usual meanings. The usual optional keyword arguments may be appended to these parameters in either call as desired. output
is always a pandas Series with default index the same length as master
, except that in the first case strings are returned while in the second case their corresponding IDs from master_id
are returned.
A similar pattern exists for the match_most_similar
function:
output = match_most_similar(master, duplicates)
output = match_most_similar(master, duplicates, master_id, duplicates_id)
In both cases output
is a pandas Series the same length as duplicates
. In the first case, output
has values from both master
and duplicates
as previously described in README.md, while in the second case their corresponding IDs from master_id
and duplicates_id
are returned.
Agreed with @ParticularMiner that the output of the grouping functions should be used as a mapping table.
The output would be something like:
id | name | name_deduped | goup-id |
---|---|---|---|
BB016741P | Mega Enterprises Corporation | Mega Enterprises Corporation | BB016741P |
CC082744L | Hyper Startup Incorporated | Hyper Startup Incorporated | CC082744L |
AA098762D | Hyper Startup Inc. | Hyper Startup Incorporated | CC082744L |
BB099931J | Hyper-Startup Inc. | Hyper Startup Incorporated | CC082744L |
HH072982K | Hyper Hyper Inc. | Hyper Startup Incorporated | CC082744L |
How the user then uses this mapping table is up to them of course.
Ofcourse the StringGrouper doesn't always return the correct results, which is why the add_match / remove_match functions exist.
The output would be something like:
id name name_deduped goup-id BB016741P Mega Enterprises Corporation Mega Enterprises Corporation BB016741P CC082744L Hyper Startup Incorporated Hyper Startup Incorporated CC082744L AA098762D Hyper Startup Inc. Hyper Startup Incorporated CC082744L BB099931J Hyper-Startup Inc. Hyper Startup Incorporated CC082744L HH072982K Hyper Hyper Inc. Hyper Startup Incorporated CC082744L
That's what I was looking for, as long as the (database) IDs are kept matched to their original strings and the IDs are not changed in any way then there is no problem. Of course in database terms we already have pretty well the same functionality in match_strings, but more choices are fine.
However @Bergvca I have noticed separately in the commit comments my name and personal home email address were added. I don't know how it got in there. Can you remove it??
p.s. I just did some research and it appears that if someone commits up to Github with a different local Git email address to their Github account email then Github treats that as a second person and won't hide the email address. I added my home email address to my Github account but my name and personal email is still showing in the commit message.
@ParticularMiner - your personal email address is also showing on the pull request you submitted.
Hi @Bergvca ,
Thank you for your email. Succinctly put!
I've just resolved the conflicts that arose between this branch and yours after your last PR acceptance. This branch is now ready to merge with your master branch.
Now all that's left is the documentation for this additional ID functionality -- assuming of course that you approve of these code changes. I wonder, is @justasojourner interested in writing the documentation or should I write it this time?
Hi @Bergvca - @ParticularMiner and I have done detailed investigation on this. Apart from our names and personal email addresses appearing on the latest commit messages we see them in every commit (metadata).
We (I) were under the impression that if we selected email security in GitHub settings that our emails would not be pushed to remote. It seems not... I did have some conversation with GitHub support, and though I can see what they are saying I did give them feedback that the system is quite a bit ambiguous/misleading. But anyway...
We really would like to have our personal data removed, at the same time we would (apologetically) not wish to disrupt you too much.
There are two options:
git filter-branch
that can edit all commits to remove sensitive information. OK... I do read on the Git website that they say be careful with that command. You would have to execute the command and other tasks as the repo admin.git reset --hard HEAD
to rewind back to the commit of 12 Oct 2020 and we (ParticularMiner and I) do the task of redoing a fork and submitting things again (commit -> PR) which can also be done as one commit from each of us. This will have the added benefit of making commits into the master branch more 'pretty'.We are perfectly happy to do option 2, which will be the lesser work for you. If were OK doing option 1 I can forward you the email from GitHub support with full instructions.
Please let us know what you think/would like to do.
Hi @Bergvca and @justasojourner,
So, at the moment, for the sake of backward compatibility, the functions group_similar_strings
and match_most_similar
each return a nameless index-less pandas Series containing either strings or IDs depending on how they are called (see my previous comment).
Is this sufficient? Or would you prefer both IDs and strings to be output together instead of only IDs?
One way for the functions to do this and still return a Series would be to have them index the returning Series of strings with the IDs. Then, to directly insert the output Series as 2 new columns in an existing DataFrame, a typical group_similar_strings
call would look like the following:
...
# If you want to directly insert the output of
# the function call directly into new columns
# of an existing DataFrame, then first reset
# the index of the DataFrame:
transactions = transactions.reset_index()
# Next, reset the index of the calling function:
transactions[["group-id", "name_deduped"]] = \
group_similar_strings(transactions["customer name"], transactions["customer ID"]).reset_index()
...
Note the 2nd reset_index()
method call at the end (used to convert the output Series into a DataFrame).
Here the variable transactions
is a pandas DataFrame with columns "customer name" and "customer ID" before the call; and it acquires 2 more columns "group-id" and "name_deduped" after the call:
customer ID | customer name | group-id | name_deduped |
---|---|---|---|
BB016741P | Mega Enterprises Corporation | BB016741P | Mega Enterprises Corporation |
CC082744L | Hyper Startup Incorporated | CC082744L | Hyper Startup Incorporated |
AA098762D | Hyper Startup Inc. | CC082744L | Hyper Startup Incorporated |
BB099931J | Hyper-Startup Inc. | CC082744L | Hyper Startup Incorporated |
HH072982K | Hyper Hyper Inc. | CC082744L | Hyper Startup Incorporated |
This is not how the functions work at the moment. But would the above usage be acceptable?
Hi @justasojourner,
I'm fine with doing a git reset --hard to remove the sensitive information, if you both could re-create the pull requests. One thing to keep in mind is that I already pushed a new version to pypi. It would be nice if there was still a way to be able to go back to that version (0.2.2). I didn't add a label to the commit (will do this in the future) that was pushed to pypi, but it was based on the first PR you created (commit: 1eb924bb243334c0772b9f3387e05c829e5f9672). Is it possible for you to create the PR's in the same order? If not - that is also fine. We will just create a new version, but it on pypi, and label the commit that is linked to that new version.
Hi @ParticularMiner ,
Good point, I think for most users it would be best to have both the (group) ID and group string in the output, when using the optional ID column. From what I understand your proposal is to add the ID's as the Series Index. I think that makes sense.
What prevents us from just outputting a dataframe with two columns when the optional ID field is input? It feels less "clean", because we have two different output types, but it happens more often in the Pandas world. E.g. .loc
sometimes outputs a Series and sometimes a DataFrame.
Funny enough this relates to my proposal here, were i suggested to set the ID column as index to input, and it would remain in the output.
Hi @Bergvca,
I see your previous comment indeed hints at this.
I've been thinking the C/C++ way all along -- that's my original background. Hence the strong typing ideals. It's taking me some time getting used to Python's weaker typing system.
Anyway, to return a Series in one case and a DataFrame in another shouldn't be too hard, I suppose. I'll try it.
Hi @Bergvca - thanks for understanding, no problem to do PRs in sequence. I notice though that the last two are not listed as PRs but just commits and you made us contributors. Were 67ac52b & 23806f7 PRs or straight commits??
Anyway we will do all the preparation (to make sure nothing is lost) and share with you and then you can do reset and we post relevant changes (PRs). We’ll do prep tomorrow our time so it’s ready for you your time tomorrow. Will confirm tomorrow that we are ready to proceed.
Basically thinking the plan would be to save the code at the commit/PRs snapshot points as OS files. Then delete our forked repos, then you do reset -HARD HEAD to go back to your commit of 12 October.
Then we 're-fork' from your repo, make branch, copy source files back over to branch, do just one commit (no need for many) and raise PR. Then the other two PRs can be done similarly one by one.
Hi @Bergvca and @justasojourner,
It's done.
As discussed, the functions group_similar_strings
and match_most_similar
can now (depending on how they are called) each return either a nameless index-less pandas Series containing strings or a DataFrame containing both strings and IDs.
To directly insert the output DataFrame as 2 new columns in an existing DataFrame, a typical group_similar_strings
call would look like the following:
...
# If you want to directly insert the output of
# the function call directly into new columns
# of an existing DataFrame, then first reset
# the index of the DataFrame, if its index has
# been set:
transactions = transactions.reset_index()
# Next, call the function:
transactions[["group-id", "name_deduped"]] = \
group_similar_strings(transactions["customer name"], transactions["customer ID"])
...
Here the variable transactions
is a pandas DataFrame with columns "customer name" and "customer ID" before the call; and it acquires 2 more columns "group-id" and "name_deduped" after the call:
customer ID | customer name | group-id | name_deduped |
---|---|---|---|
BB016741P | Mega Enterprises Corporation | BB016741P | Mega Enterprises Corporation |
CC082744L | Hyper Startup Incorporated | CC082744L | Hyper Startup Incorporated |
AA098762D | Hyper Startup Inc. | CC082744L | Hyper Startup Incorporated |
BB099931J | Hyper-Startup Inc. | CC082744L | Hyper Startup Incorporated |
HH072982K | Hyper Hyper Inc. | CC082744L | Hyper Startup Incorporated |
Cheers!
The goal of the preparation will be to ensure that no code/documentation is lost. The preparation part needs only to be done by @justasojourner and @ParticularMiner. The preparation steps should ensure that code/documents that were committed via various PRs can be submitted again in the same 'sequence'.
@Bergvca - can you advise if you want me to do two commits/PRs, or if I can fold the documentation changes into the same PR as the initial PR, that is the initial coding change.
In the procedure @justasojourner will commit/PR first. We will avoid clashes because:
During the process @justasojourner and @ParticularMiner will double check that their Git name & email config are correct. Ha ha.
master
if they are satisfactory. At this point:
Then we should be good! Zo. Nu eerst een Bavaria.
@Bergvca - I've done my prep. copies of files made and both local & remote repos gone.
Hi @justasojourner, @Bergvca,
Whenever you are ready!
I have saved OS file snapshots of the project at
I'll repeat the 'Open Pull Request' procedure to restore the above wherever required.
For completeness, I reiterate that @justasojourner is also able to
Haha - wow, now I see this I think the script option might have been easier.. Anyway, the repo has been restored to the October commit! Go ahead to create the PR's.
I'm fine with puting the documentation in one of the other PRs, whatever is easiest for you.
OK, mine is all double checked and clean and the master branch of my fork
https://github.com/justasojourner/string_grouper
has been merge/synched back to
https://github.com/Bergvca/string_grouper
using an upstream remote.
Ok, all is done now I believe. Thanks for setting it up!
Thanks @Bergvca. You've been quite patient with us!
cc: @justasojourner Hi @Bergvca,
As promised in Issue #22, I've added the the IDs to the remaining 2 functions. I hope it matches your expectations.
Cheers!