Closed ParticularMiner closed 3 years ago
I will have to take time to go through and digest this, there is a bit to read.
@ParticularMiner
Questions:
String Grouper can use datetime data in a Series to determine group-representatives when grouping striings.
I find the various docs on datetime datatypes somewhat confusing, so I decided to play around with the datatype myself --- mostly for my own reference. Still, here are the results of my 'experiments'.
I'm here restricting myself to the pandas implementation of datetime: pandas.Timestamp
.
NB. Currently, String Grouper's group_similar_strings
function is naive to the pandas.Timestamp
datatype. The optional parameter timestamps
(required when group_rep='oldest'
) does not necessarily mean pandas.Timestamp
datatypes are required --- it can accept decimals, integers, numpy datetime datatypes, pandas.Timestamp
datatypes that have all been localized to the same timezone (see below), and so on. But it does not accept strings. Strings should first be converted to an acceptable datatype.
import pandas as pd
from string_grouper import group_similar_strings
The following function list_datatypes_of
is just a utility to examine the datatypes of each column of a DataFrame
:
def list_datatypes_of(dataframe):
for col in range(len(dataframe.columns)):
print(f"DataType of {dataframe.columns[col]}: ", type(dataframe.iloc[:, col][1]))
Here's some artificial data I pulled out from the top of my head:
customers_df_timestamp_is_a_string = pd.DataFrame(
[
('BB016741P', 'Mega Enterprises Corporation', 'Address0', 'Tel0', 'Description0', 0.2, '2014-12-30 10:55:00-02:00'),
('CC082744L', 'Hyper Startup Incorporated', '', 'Tel1', '', 0.5, '2017-01-01 20:23:15-05:00'),
('AA098762D', 'Hyper Startup Inc.', 'Address2', 'Tel2', 'Description2', 0.3, '2020-10-20 15:29:30+02:00'),
('BB099931J', 'Hyper-Startup Inc.', 'Address3', 'Tel3', 'Description3', 0.1, '2013-07-01 03:34:45-05:00'),
('HH072982K', 'Hyper Hyper Inc.', 'Address4', '', 'Description4', 0.9, '2005-09-11 11:56:00-07:00'),
('EE059082Q', 'Mega Enterprises Corp.', 'Address5', 'Tel5', 'Description5', 1.0, '1998-04-14 09:21:11+00:00')
],
columns=('Customer ID', 'Customer Name', 'Address', 'Tel', 'Description', 'weight', 'timestamp')
)
Notice that each item of the 'timestamp' column is actually a string! And this is not apparent from the display:
customers_df_timestamp_is_a_string
Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | |
---|---|---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 |
1 | CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | ||
2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 |
3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 |
4 | HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | |
5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 |
... nor from the DataFrame
's dtypes method:
customers_df_timestamp_is_a_string.dtypes
Customer ID object
Customer Name object
Address object
Tel object
Description object
weight float64
timestamp object
dtype: object
But it is clear from the utility function:
list_datatypes_of(customers_df_timestamp_is_a_string)
DataType of Customer ID: <class 'str'>
DataType of Customer Name: <class 'str'>
DataType of Address: <class 'str'>
DataType of Tel: <class 'str'>
DataType of Description: <class 'str'>
DataType of weight: <class 'numpy.float64'>
DataType of timestamp: <class 'str'>
How does group_similar_strings
fair with such data? Not well, as expected. That's because it has no way of determining the maximum of a group of strings. Try it yourself:
customers_df_timestamp_is_a_string[['group rep ID', 'group rep']] = \
group_similar_strings(
customers_df_timestamp_is_a_string['Customer Name'],
customers_df_timestamp_is_a_string['Customer ID'],
timestamps=customers_df_timestamp_is_a_string['timestamp'],
group_rep='oldest',
min_similarity=0.6)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
...
TypeError: reduction operation 'argmin' not allowed for this dtype
Now I'm going to convert the 'timestamp' column into pandas.Timestamp datatype. First, I'm just curious how pandas would save such data to a CSV file:
customers_df_timestamp_is_a_string.to_csv('customers_df_timestamp_is_a_string.csv')
Here are its contents:
,Customer ID,Customer Name,Address,Tel,Description,weight,timestamp
0,BB016741P,Mega Enterprises Corporation,Address0,Tel0,Description0,0.2,2014-12-30 10:55:00-02:00
1,CC082744L,Hyper Startup Incorporated,,Tel1,,0.5,2017-01-01 20:23:15-05:00
2,AA098762D,Hyper Startup Inc.,Address2,Tel2,Description2,0.3,2020-10-20 15:29:30+02:00
3,BB099931J,Hyper-Startup Inc.,Address3,Tel3,Description3,0.1,2013-07-01 03:34:45-05:00
4,HH072982K,Hyper Hyper Inc.,Address4,,Description4,0.9,2005-09-11 11:56:00-07:00
5,EE059082Q,Mega Enterprises Corp.,Address5,Tel5,Description5,1.0,1998-04-14 09:21:11+00:00
Let's get on with converting the 'timestamp' column into pandas.Timestamp datatype:
first copy the DataFrame into a new one:
customers_df_timestamp_is_a_true_pandas_timestamp = customers_df_timestamp_is_a_string.copy()
Now do the conversion:
customers_df_timestamp_is_a_true_pandas_timestamp['timestamp']=\
customers_df_timestamp_is_a_true_pandas_timestamp['timestamp'].transform(lambda t: pd.Timestamp(t))
Nothing seems to have changed from the display:
customers_df_timestamp_is_a_true_pandas_timestamp
Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | |
---|---|---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 10:55:00-02:00 |
1 | CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-01 20:23:15-05:00 | ||
2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 15:29:30+02:00 |
3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 03:34:45-05:00 |
4 | HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 11:56:00-07:00 | |
5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11+00:00 |
... nor from the DataFrame's dtypes method:
customers_df_timestamp_is_a_true_pandas_timestamp.dtypes
Customer ID object
Customer Name object
Address object
Tel object
Description object
weight float64
timestamp object
dtype: object
But the utility function has noticed the change: pandas.Timestamp
is here!
list_datatypes_of(customers_df_timestamp_is_a_true_pandas_timestamp)
DataType of Customer ID: <class 'str'>
DataType of Customer Name: <class 'str'>
DataType of Address: <class 'str'>
DataType of Tel: <class 'str'>
DataType of Description: <class 'str'>
DataType of weight: <class 'numpy.float64'>
DataType of timestamp: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
group_similar_strings
still does not work. Why? Because the timestamp data are not all from the same timezone. Try it yourself:
customers_df_timestamp_is_a_true_pandas_timestamp[['group rep ID', 'group rep']] = \
group_similar_strings(
customers_df_timestamp_is_a_true_pandas_timestamp['Customer Name'],
customers_df_timestamp_is_a_true_pandas_timestamp['Customer ID'],
timestamps=customers_df_timestamp_is_a_true_pandas_timestamp['timestamp'],
group_rep='oldest',
min_similarity=0.6)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
...
TypeError: reduction operation 'argmin' not allowed for this dtype
To solve this problem let's try one more conversion:
numpy.datetime64
We follow the same procedure as above:
customers_df_timestamp_is_a_numpy_datetime = customers_df_timestamp_is_a_true_pandas_timestamp.copy()
customers_df_timestamp_is_a_numpy_datetime['timestamp']=\
customers_df_timestamp_is_a_numpy_datetime['timestamp'].transform(lambda t: t.to_numpy())
This time though you see a change in the time format: (Interesting!)
customers_df_timestamp_is_a_numpy_datetime
Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | |
---|---|---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 12:55:00 |
1 | CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-02 01:23:15 | ||
2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 13:29:30 |
3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 08:34:45 |
4 | HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 18:56:00 | |
5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11 |
The DataFrame's dtypes method notices!
customers_df_timestamp_is_a_numpy_datetime.dtypes
Customer ID object
Customer Name object
Address object
Tel object
Description object
weight float64
timestamp datetime64[ns]
dtype: object
Strangely the utility function does not register a change in datatype: (Interesting!)
list_datatypes_of(customers_df_timestamp_is_a_numpy_datetime)
DataType of Customer ID: <class 'str'>
DataType of Customer Name: <class 'str'>
DataType of Address: <class 'str'>
DataType of Tel: <class 'str'>
DataType of Description: <class 'str'>
DataType of weight: <class 'numpy.float64'>
DataType of timestamp: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
But ... YAY! group_similar_strings
works:
customers_df_timestamp_is_a_numpy_datetime[['group rep ID', 'group rep']] = \
group_similar_strings(
customers_df_timestamp_is_a_numpy_datetime['Customer Name'],
customers_df_timestamp_is_a_numpy_datetime['Customer ID'],
timestamps=customers_df_timestamp_is_a_numpy_datetime['timestamp'],
group_rep='oldest',
min_similarity=0.6)
... and here are the results:
customers_df_timestamp_is_a_numpy_datetime
Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | group rep ID | group rep | |
---|---|---|---|---|---|---|---|---|---|
0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 12:55:00 | EE059082Q | Mega Enterprises Corp. |
1 | CC082744L | Hyper Startup Incorporated | Tel1 | 0.5 | 2017-01-02 01:23:15 | BB099931J | Hyper-Startup Inc. | ||
2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 13:29:30 | BB099931J | Hyper-Startup Inc. |
3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 08:34:45 | BB099931J | Hyper-Startup Inc. |
4 | HH072982K | Hyper Hyper Inc. | Address4 | Description4 | 0.9 | 2005-09-11 18:56:00 | HH072982K | Hyper Hyper Inc. | |
5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11 | EE059082Q | Mega Enterprises Corp. |
I'm curious to see how things changed from stage to stage. So I summarize the conversion results here using the 'timestamp' column data from all the stages of conversion:
pd.DataFrame(
{
'string datatype': customers_df_timestamp_is_a_string['timestamp'],
'pandas.Timestamp datatype (time zone)': customers_df_timestamp_is_a_true_pandas_timestamp['timestamp'],
'pandas.Timestamp datatype (localized)': customers_df_timestamp_is_a_numpy_datetime['timestamp']
}
)
string datatype | pandas.Timestamp datatype (time zone) | pandas.Timestamp datatype (localized) | |
---|---|---|---|
0 | 2014-12-30 10:55:00-02:00 | 2014-12-30 10:55:00-02:00 | 2014-12-30 12:55:00 |
1 | 2017-01-01 20:23:15-05:00 | 2017-01-01 20:23:15-05:00 | 2017-01-02 01:23:15 |
2 | 2020-10-20 15:29:30+02:00 | 2020-10-20 15:29:30+02:00 | 2020-10-20 13:29:30 |
3 | 2013-07-01 03:34:45-05:00 | 2013-07-01 03:34:45-05:00 | 2013-07-01 08:34:45 |
4 | 2005-09-11 11:56:00-07:00 | 2005-09-11 11:56:00-07:00 | 2005-09-11 18:56:00 |
5 | 1998-04-14 09:21:11+00:00 | 1998-04-14 09:21:11+00:00 | 1998-04-14 09:21:11 |
The last conversion is also a "localization" (normalizing the pandas Timestamp
data to the same timezone. That's how String Grouper likes it. (Examine the values to assure yourself that the localization is indeed correct. )
So to answer your questions:
What if timestamps are not available? What if the database only has dates?
In that case, still use pandas Timestamps
. (pandas
simply sets their time portions to 00:00:00)
What about timezone aware timestamps? If a timestamp is TZ aware then that means the (absolute) datetime value would have to be judged with respect to UTC time, that will make the whole process more difficult.
The user only need use pandas Timestamp
facillities to localize/convert all the different timezone data into a single timezone. to_numpy()
is sufficient as shown in my previous message.
It might be simpler to just use dates?
I don't think so. Dates are simply datetime or Timestamps with the time part ignored (or set to 00:00:00). So Datetimes can be used as dates. In fact, I do not know of any date-only datatype. Do you?
Where are you getting the column titles from?
I specify the column titles upon creation of the DataFrame customers_df
. The additional columns (like 'group rep id' and 'group rep') are obtained through assignment.
Vanilla grouping?
You are referring to the default? This is the same as setting the option group_rep='first'
. String Grouper uses the first string in the group as representative. It therefore is based on the order in which the strings appear.
'Centroid' — can you please, in non-mathematical, plain English, explain this option ;-)
I'll try ... so each string has a number of duplicates to which it has been matched and each match has a similarity rating. Adding up the similarities ratings over all the matches for a particular string gives you the weight of that string. The string in the group with the highest weight is the centroid of the group, and is then chosen as the representative.
'Weight-Based' — the 'weight' column data comes via a database import? I don't think this will happen in a real-world scenario. I don't have any databases with a 'weight' column. Obviously to add such data manually is not an option, so if it is done programmatically then how are the weight values calculated?
You're quite right. In these examples, I just chose the easiest way to manufacture a weight Series: by extracting them from a column in the database. But that is only my choice. You can input any Series there as long as it has the same number of rows as the strings to group.
'Cleanest' — based on what criteria? Is this checking rows with respect to finding rows that have the most filled in fields? With the premise that a row that has the most data is the most likely to be a 'good' record?
Yes, it is exactly as you say.
import pandas as pd
from string_grouper import group_similar_strings
So I've got a CSV file "_customers_df_timestamp_is_a_true_pandastimestamp.csv" containing, among other things, timestamps in a string format. I will use the following utility function to display its contents:
def display_file_contents(file):
with open(file, 'r') as f:
contents = f.read()
print(contents)
display_file_contents("customers_df_timestamp_is_a_true_pandas_timestamp.csv")
,Customer ID,Customer Name,Address,Tel,Description,weight,timestamp
0,BB016741P,Mega Enterprises Corporation,Address0,Tel0,Description0,0.2,2014-12-30 10:55:00-02:00
1,CC082744L,Hyper Startup Incorporated,,Tel1,,0.5,2017-01-01 20:23:15-05:00
2,AA098762D,Hyper Startup Inc.,Address2,Tel2,Description2,0.3,2020-10-20 15:29:30+02:00
3,BB099931J,Hyper-Startup Inc.,Address3,Tel3,Description3,0.1,2013-07-01 03:34:45-05:00
4,HH072982K,Hyper Hyper Inc.,Address4,,Description4,0.9,2005-09-11 11:56:00-07:00
5,EE059082Q,Mega Enterprises Corp.,Address5,Tel5,Description5,1.0,1998-04-14 09:21:11+00:00
As you can see, the 'timestamp' column has datetime + timezone information stored as strings in the file.
The main lesson learned from the previous notebook session is this: datetime-timezone data in the form of strings must first be converted to pandas Timestamps and then localized to the same timezone. And the way to do this is to use the function pandas.Timestamp(string_value).to_numpy()
. I've since learned that you could alternatively use pandas.to_datetime(string_value).to_numpy()
.
But is there a way to do all this while importing the data from the file into the pandas Dataframe? Fortunately, the answer is, "Yes!"
def my_parser(dt):
return pd.Timestamp(dt).to_numpy()
customer_df_parsed_from_csv=\
pd.read_csv(
'customers_df_timestamp_is_a_true_pandas_timestamp.csv',
parse_dates=['timestamp'],
date_parser=my_parser
)
Let's display the contents of our new Dataframe:
customer_df_parsed_from_csv
Unnamed: 0 | Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | |
---|---|---|---|---|---|---|---|---|
0 | 0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 12:55:00 |
1 | 1 | CC082744L | Hyper Startup Incorporated | NaN | Tel1 | NaN | 0.5 | 2017-01-02 01:23:15 |
2 | 2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 13:29:30 |
3 | 3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 08:34:45 |
4 | 4 | HH072982K | Hyper Hyper Inc. | Address4 | NaN | Description4 | 0.9 | 2005-09-11 18:56:00 |
5 | 5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11 |
Et Voilà ! Importation, conversion and localization have all taken place in a single step.
Once again, let's make of of the utility function list_datatypes_of
to examine the datatypes of each column of the DataFrame
:
def list_datatypes_of(dataframe):
for col in range(len(dataframe.columns)):
print(f"DataType of {dataframe.columns[col]}: ", type(dataframe.iloc[:, col][1]))
list_datatypes_of(customer_df_parsed_from_csv)
DataType of Unnamed: 0: <class 'numpy.int64'>
DataType of Customer ID: <class 'str'>
DataType of Customer Name: <class 'str'>
DataType of Address: <class 'float'>
DataType of Tel: <class 'str'>
DataType of Description: <class 'float'>
DataType of weight: <class 'numpy.float64'>
DataType of timestamp: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
customer_df_parsed_from_csv.dtypes
Unnamed: 0 int64
Customer ID object
Customer Name object
Address object
Tel object
Description object
weight float64
timestamp datetime64[ns]
dtype: object
group_similar_strings
will therefore work with it nicely:
customer_df_parsed_from_csv[['group rep ID', 'group rep']] = \
group_similar_strings(
customer_df_parsed_from_csv['Customer Name'],
customer_df_parsed_from_csv['Customer ID'],
timestamps=customer_df_parsed_from_csv['timestamp'],
group_rep='oldest',
min_similarity=0.6)
So it gives the following result:
customer_df_parsed_from_csv
Unnamed: 0 | Customer ID | Customer Name | Address | Tel | Description | weight | timestamp | group rep ID | group rep | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | BB016741P | Mega Enterprises Corporation | Address0 | Tel0 | Description0 | 0.2 | 2014-12-30 12:55:00 | EE059082Q | Mega Enterprises Corp. |
1 | 1 | CC082744L | Hyper Startup Incorporated | NaN | Tel1 | NaN | 0.5 | 2017-01-02 01:23:15 | BB099931J | Hyper-Startup Inc. |
2 | 2 | AA098762D | Hyper Startup Inc. | Address2 | Tel2 | Description2 | 0.3 | 2020-10-20 13:29:30 | BB099931J | Hyper-Startup Inc. |
3 | 3 | BB099931J | Hyper-Startup Inc. | Address3 | Tel3 | Description3 | 0.1 | 2013-07-01 08:34:45 | BB099931J | Hyper-Startup Inc. |
4 | 4 | HH072982K | Hyper Hyper Inc. | Address4 | NaN | Description4 | 0.9 | 2005-09-11 18:56:00 | HH072982K | Hyper Hyper Inc. |
5 | 5 | EE059082Q | Mega Enterprises Corp. | Address5 | Tel5 | Description5 | 1.0 | 1998-04-14 09:21:11 | EE059082Q | Mega Enterprises Corp. |
The same procedure can be used even if the strings in the CSV data file have date-only information. pandas
merely treats them as pandas Timestamps
with time part '00:00:00'. Try it just to be sure!
Hi @Bergvca , @justasojourner ,
I've amended the timestamps
argument of group_similar_strings
to enable it to parse strings that are date- or datetime-like.
So now there is no need to go through the hassle of converting date/datetime-like strings or localizing timezone-aware datetimes yourself. You can simply input strings. However the strings should represent valid dates or datetimes with or without timezone information.
Cheers!
Hi @ParticularMiner - I think the code itself looks good, however I feel like it is getting too complex and large.
In the end the timestamp, cleanest and weight solution are essentially the same thing (unless I'm misunderstanding), only we are providing the code to calculate the weights. Even the weight solution could be removed by just ordering the input data by the weights, and making the "first" string option explicit. I don't think code that parses timestamps for example belongs in the StringGrouper class. Neither does the code that checks which row in a dataframe is the cleanest.
I don't want to see the code wasted though, maybe it can be put into a seperate file with utility functions?
So:
What do you think?
Hi @Bergcva,
Thanks.
... maybe it can be put into a seperate file with utility functions?
Sure I could do this. I see your point. Perhaps you could give me some guidance on how to do that. Would this new file be a new module that can be imported together with string_grouper
?
Hi @Bergvca
I think I now fully understand your proposal to restructure/modularize the code.
You might have already noticed from the current code, that I avoid to sort the input data throughout the entire deduplication and group-representative selection process. So before implementing your proposal, let me first confer with you about the following:
While at its very core the process of selecting a group representative is a mere weighting or sorting process, there are a few pros and cons to consider when choosing one over the other:
Weighting preserves the original order of the input data. This could be an attractive feature to those users who may not want the order of their data to be disturbed, and helps them avoid the hassle of taking measures to somehow stash the preferred order of the data before using String Grouper and then restoring it afterwards.
Performance: sorting has a complexity that scales at least as O(N log N) where N is the size of the data, while finding a maximum weight within a group has a complexity of at least O(G) where G is the size of the group. But because the 'group by' algorithm itself is also O(N log N), the sorting method wins by performance.
As you mentioned in an earlier message (and I believe you were right), the current ability of pandas 'group by' to preserve the original order of group members is more of a side-effect than an essential aspect of it. Since we cannot guarantee that it will not be deprecated it the future, perhaps it would be more prudent to choose functionality that does not depend critically on this side-effect, namely, weighting.
How about this:
a separate module _string_grouperutils.py containing three functions:
new_group_rep_by_timestamp
new_group_rep_by_completeness
new_group_rep_by_weight
that each have two required arguments:
grouped_data
: the input DataFrame already grouped (by String Grouper), and thus already has a 'group' column.group_col
: the name (of str
type) or index (of int
type) of the column in grouped_data
having the groups.Meanwhile, our original module _stringgrouper.py retains only the two group_rep
options: 'first'
and 'centroid'
(with 'centroid'
as the default) and discards the other three options.
Thus any user that is not satisfied with String Grouper's choice of group representatives may then do a standard import:
from string_grouper_utils import new_group_rep_by_timestamp, new_group_rep_by_completeness, new_group_rep_by_weight
and change the group representatives as desired.
Does that fully capture your proposal?
Hi @Bergvca, @justasojourner ,
As discussed, I've extended and unit-tested group-representative selection functionality with five options for the new keyword argument
group_rep
: 'first', 'oldest', 'cleanest', 'weight-based' and 'centroid'.Please use the following (hopefully self-explanatory) code snippets to test the added functionality after merging this branch, and let me know what you think: