Closed ParticularMiner closed 3 years ago
Hi @ParticularMiner ,
Yes, currently it picks the first, but this is not "guaranteed". One option would be to give a weight to each string. The highest weight gets to become the group representative. For example the database record that is the "cleanest" by some sort of metric (for example: has most fields filled in, or has highest timestamp).
It would be really interesting if such a weight could be calculated within a group, based on a custom function. For example the string with the highest aggregated similarity within the group (the centroid if you will).
cc: @justasojourner
Hi @Bergvca ,
I like the idea of 'weights' because it generalizes the problem quite well and naturally, giving the user the freedom to choose how to define the weights. Moreover, pandas allows custom functions to be applied to groups, so that may very well be possible.
Luckily, the 'centroid' example you gave is directly achievable without additional data input, and thus could be provided to the user as an option.
Incidentally, the pandas
docs specify that groupby
preserves the order of members of a group:
Note that groupby will preserve the order in which observations are sorted within each group. For example, the groups created by groupby() below are in the order they appeared in the original DataFrame:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]}) In [25]: df3.groupby(["X"]).get_group("A") Out[25]: X Y 0 A 1 2 A 3 In [26]: df3.groupby(["X"]).get_group("B") Out[26]: X Y 1 B 4 3 B 2
(_copied from pandas
docs: "Group by: split-apply-combine"_)
So as long this feature of groupby
is not deprecated in pandas
, we can exploit it by sorting input data according to 'weight' before grouping and then choosing first (or last). Nevertheless, you are right, we cannot guarantee that this feature will always exist. So we might have to use idxmin()
or idxmax()
to be safe (and perhaps faster):
Keep other columns when using min() with groupby
In [60]: df = pd.DataFrame( ....: {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]} ....: ) ....: In [61]: df Out[61]: AAA BBB 0 1 2 1 1 1 2 1 3 3 2 4 4 2 5 5 2 1 6 3 2 7 3 3
Method 1 : idxmin() to get the index of the minimums
In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()] Out[62]: AAA BBB 1 1 1 5 2 1 6 3 2
Method 2 : sort then take first of each
In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first() Out[63]: AAA BBB 0 1 1 1 2 1 2 3 2
Notice the same results, with the exception of the index.
(_copied from pandas
docs: "Cookbook"_)
So maybe the "centroid" (highest cumulative weight) should be the default, ideally. However I'm afraid it could add to a lot of computational costs, for each string, within each group you'd have to sum all weights, and then pick the highest sum.
The weights could then be given as an optional Series.
Or we add a parameter (e.g. group_rep) that could either be a string: 'first' (default, current method), 'centroid', or a Series with weights.
Or a parameter that could be string only: 'first', 'centroid', 'weight_based'. For the weight_based option an additional series with weights need to be given.
What do you think?
Great. Exactly what I thought. My plan to compute the "centroid" as you've defined it and reduce computational costs is to:
scipy.sparse.csr_matrix.sum
to obtain a vector of similarity aggregates each corresponding to a string. idxmax()
The rest follows as usual.
Note of caution: while I like the idea of the "centroid" very much, I feel we must at least ensure that it gives what we really want. That may mean taking a closer look at the structure of a few sample groups (for example, visualizing their connectivity) . But perhaps you already have a better idea of how these "graphs" look like and you are confident that the centroid is the appropriate measure?
Hello - can I just add some practical, real-world examples/benefits if you are talking about grouping. If the consideration is finding the ‘primary’/lead value to group on then two ways to do this is to:
cc: @Bergvca
Hi @justasojourner ,
Thanks for your contributions. These will certainly be included as builtin options available to the user, as @Bergvca suggested earlier:
One option would be to give a weight to each string. The highest weight gets to become the group representative. For example the database record that is the "cleanest" by some sort of metric (for example: has most fields filled in, or has highest timestamp).
Let me now summarize the list of options we have gathered so far (just so we are all on the same page):
The default. group_rep='first'
will be provided as a keyword argument. This is the fastest option and it is assumed here that the user doesn't care much which string is chosen as group representative. With the current version of pandas, the final result is determined by the sort order of the input strings, and no additional input data is required.
group_rep='oldest'
. This is the case of the timestamp (or _createdate) field. We can define
weight (or importance) = \<age of record> = \<current date> - create_date |
---|
Here an optional Series argument named, say, timestamp
, will be provided to contain the timestamp data.
group_rep='fullest'
(or 'cleanest'
). This is the case of the "filled-in status" of the "ancillary fields", and we may define
weight = \<number of non-null fields> |
---|
Here an optional DataFrame parameter named, say, other_fields
, will be provided to contain the ancillary field data. (This parameter seems somewhat loaded to me though --- I hope it doesn't require copying excessively large amounts of data. We should consider passing this parameter by reference if possible.)
In the case of combining the two latter cases, I'm not sure. On the one hand, if multiple rows in a group are equally "well-filled" the user may prefer the one with the oldest timestamp. Yet, on the other hand, an older row not-so-well-filled may still be preferred.
I suppose we cannot speculate for the user whether the one case is more (or less) important than the other. Which is why we may need to leave it up to the user to decide by allowing him/her to provide an input series of weights precomputed according to his/her own discretion, as @Bergvca suggested.
This is the most versatile option, as it permits any definition of weight as yet unknown to us. Here again, an optional Series argument, say, weights
will be provided with/without the keyword argument value group_rep='heaviest'
(or 'weight-based'
)
group_rep='centroid'
(or string with highest similarity 'aggregate'), which is the most interesting to me at the moment, because it hints at identifying the correct group representative by intuiting that majority of the duplicate strings will be matched to the original string with the highest cosine similarities. But whether this is actually true (or true most of the time) may require further study/observation, as I tried to explain in my previous message.
Here the weights will be determined by simply summing the similarity sparse square matrix along the rows (or columns), and no additional input data will be required.
As you can see, in all these cases (except for the default) appropriate weights can be defined, so that idxmax()
can be used to identify the group member with the highest weight as the group representative.
If I've missed anything, do let me know please.
Hi @Bergvca,
@justasojourner and I have been discussing ways to enable String Grouper to choose other candidates to represent the groups it finds.
At the moment String Grouper’s choices are arbitrary — they likely depend on the sorting order of the input string data in the call to function
group_similar_strings
.One prudent alternative we could think of was this: choose the earliest record in the group assuming of course that timestamp metadata is available. This suggestion is based on the premise that unnecessary duplicate records are often created after the original record.
As the code stands now, it may be possible (I’m not yet sure; I have to verify this) to achieve the above result by simply sorting the input Series by timestamp in ascending order before applying
group_similar_strings
on them.But perhaps there’s a way to re-code
group_similar_strings
to accept any given specification for the group representatives.Your thoughts on this are welcome!