Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

On Group Representatives #26

Closed ParticularMiner closed 3 years ago

ParticularMiner commented 3 years ago

Hi @Bergvca,

@justasojourner and I have been discussing ways to enable String Grouper to choose other candidates to represent the groups it finds.

At the moment String Grouper’s choices are arbitrary — they likely depend on the sorting order of the input string data in the call to function group_similar_strings.

One prudent alternative we could think of was this: choose the earliest record in the group assuming of course that timestamp metadata is available. This suggestion is based on the premise that unnecessary duplicate records are often created after the original record.

As the code stands now, it may be possible (I’m not yet sure; I have to verify this) to achieve the above result by simply sorting the input Series by timestamp in ascending order before applying group_similar_strings on them.

But perhaps there’s a way to re-code group_similar_strings to accept any given specification for the group representatives.

Your thoughts on this are welcome!

Bergvca commented 3 years ago

Hi @ParticularMiner ,

Yes, currently it picks the first, but this is not "guaranteed". One option would be to give a weight to each string. The highest weight gets to become the group representative. For example the database record that is the "cleanest" by some sort of metric (for example: has most fields filled in, or has highest timestamp).

It would be really interesting if such a weight could be calculated within a group, based on a custom function. For example the string with the highest aggregated similarity within the group (the centroid if you will).

ParticularMiner commented 3 years ago

cc: @justasojourner

Hi @Bergvca ,

I like the idea of 'weights' because it generalizes the problem quite well and naturally, giving the user the freedom to choose how to define the weights. Moreover, pandas allows custom functions to be applied to groups, so that may very well be possible.

Luckily, the 'centroid' example you gave is directly achievable without additional data input, and thus could be provided to the user as an option.

Incidentally, the pandas docs specify that groupby preserves the order of members of a group:

Note that groupby will preserve the order in which observations are sorted within each group. For example, the groups created by groupby() below are in the order they appeared in the original DataFrame:

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")
Out[25]: 
   X  Y
0  A  1
2  A  3

In [26]: df3.groupby(["X"]).get_group("B")
Out[26]: 
   X  Y
1  B  4
3  B  2

(_copied from pandas docs: "Group by: split-apply-combine"_)

So as long this feature of groupby is not deprecated in pandas, we can exploit it by sorting input data according to 'weight' before grouping and then choosing first (or last). Nevertheless, you are right, we cannot guarantee that this feature will always exist. So we might have to use idxmin() or idxmax() to be safe (and perhaps faster):

Keep other columns when using min() with groupby

In [60]: df = pd.DataFrame(
   ....:     {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]}
   ....: )
   ....: 

In [61]: df
Out[61]: 
   AAA  BBB
0    1    2
1    1    1
2    1    3
3    2    4
4    2    5
5    2    1
6    3    2
7    3    3

Method 1 : idxmin() to get the index of the minimums

In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]: 
   AAA  BBB
1    1    1
5    2    1
6    3    2

Method 2 : sort then take first of each

In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
Out[63]: 
   AAA  BBB
0    1    1
1    2    1
2    3    2

Notice the same results, with the exception of the index.

(_copied from pandas docs: "Cookbook"_)

Bergvca commented 3 years ago

So maybe the "centroid" (highest cumulative weight) should be the default, ideally. However I'm afraid it could add to a lot of computational costs, for each string, within each group you'd have to sum all weights, and then pick the highest sum.

The weights could then be given as an optional Series.

Or we add a parameter (e.g. group_rep) that could either be a string: 'first' (default, current method), 'centroid', or a Series with weights.

Or a parameter that could be string only: 'first', 'centroid', 'weight_based'. For the weight_based option an additional series with weights need to be given.

What do you think?

ParticularMiner commented 3 years ago

Great. Exactly what I thought. My plan to compute the "centroid" as you've defined it and reduce computational costs is to:

  1. sum the "similarity sparse matrix" (which is square) along the rows (or columns) using scipy.sparse.csr_matrix.sum to obtain a vector of similarity aggregates each corresponding to a string.
  2. join the result in (1) with the (raw) group labels.
  3. group by the raw group labels and aggregate using idxmax()

The rest follows as usual.

Note of caution: while I like the idea of the "centroid" very much, I feel we must at least ensure that it gives what we really want. That may mean taking a closer look at the structure of a few sample groups (for example, visualizing their connectivity) . But perhaps you already have a better idea of how these "graphs" look like and you are confident that the centroid is the appropriate measure?

justasojourner commented 3 years ago

Hello - can I just add some practical, real-world examples/benefits if you are talking about grouping. If the consideration is finding the ‘primary’/lead value to group on then two ways to do this is to:

  1. Get a ‘create_date’ field (very common in a DB) added from the database into the DataFrame. For each group of similar values find the row with the lowest value date. That will almost always be the first record, the others will be dupes. Group the other similar values into that one.
  2. Use ‘ancillary value’ fields to increase the importance of a row. For example for a customer record use a postal code field. if the field has a value then more effort has been undertaken for that row compared to a row that has not had the postal code (extra fields) filled in. Make the row having the most data the ‘primary’ group into row.
  3. Both of the above.
ParticularMiner commented 3 years ago

cc: @Bergvca

Hi @justasojourner ,

Thanks for your contributions. These will certainly be included as builtin options available to the user, as @Bergvca suggested earlier:

One option would be to give a weight to each string. The highest weight gets to become the group representative. For example the database record that is the "cleanest" by some sort of metric (for example: has most fields filled in, or has highest timestamp).

Let me now summarize the list of options we have gathered so far (just so we are all on the same page):

  1. The default. group_rep='first' will be provided as a keyword argument. This is the fastest option and it is assumed here that the user doesn't care much which string is chosen as group representative. With the current version of pandas, the final result is determined by the sort order of the input strings, and no additional input data is required.

  2. group_rep='oldest'. This is the case of the timestamp (or _createdate) field. We can define

    weight (or importance) = \<age of record> = \<current date> - create_date

    Here an optional Series argument named, say, timestamp, will be provided to contain the timestamp data.

  3. group_rep='fullest' (or 'cleanest'). This is the case of the "filled-in status" of the "ancillary fields", and we may define

    weight = \<number of non-null fields>

    Here an optional DataFrame parameter named, say, other_fields, will be provided to contain the ancillary field data. (This parameter seems somewhat loaded to me though --- I hope it doesn't require copying excessively large amounts of data. We should consider passing this parameter by reference if possible.)

  4. In the case of combining the two latter cases, I'm not sure. On the one hand, if multiple rows in a group are equally "well-filled" the user may prefer the one with the oldest timestamp. Yet, on the other hand, an older row not-so-well-filled may still be preferred.

    I suppose we cannot speculate for the user whether the one case is more (or less) important than the other. Which is why we may need to leave it up to the user to decide by allowing him/her to provide an input series of weights precomputed according to his/her own discretion, as @Bergvca suggested.

    This is the most versatile option, as it permits any definition of weight as yet unknown to us. Here again, an optional Series argument, say, weights will be provided with/without the keyword argument value group_rep='heaviest' (or 'weight-based')

  5. group_rep='centroid' (or string with highest similarity 'aggregate'), which is the most interesting to me at the moment, because it hints at identifying the correct group representative by intuiting that majority of the duplicate strings will be matched to the original string with the highest cosine similarities. But whether this is actually true (or true most of the time) may require further study/observation, as I tried to explain in my previous message.

    Here the weights will be determined by simply summing the similarity sparse square matrix along the rows (or columns), and no additional input data will be required.

As you can see, in all these cases (except for the default) appropriate weights can be defined, so that idxmax() can be used to identify the group member with the highest weight as the group representative.

If I've missed anything, do let me know please.