Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

Group representative functionality added #33

Closed ParticularMiner closed 3 years ago

ParticularMiner commented 3 years ago

Hi @Bergvca, @justasojourner ,

As discussed, I've extended and unit-tested group-representative selection functionality with five options for the new keyword argument group_rep: 'first', 'oldest', 'cleanest', 'weight-based' and 'centroid'.

Please use the following (hopefully self-explanatory) code snippets to test the added functionality after merging this branch, and let me know what you think:

import pandas as pd
from string_grouper import group_similar_strings
customers_df = pd.DataFrame(
   [
      ('BB016741P', 'Mega Enterprises Corporation', 'Address0', 'Tel0', 'Description0', 0.2),
      ('CC082744L', 'Hyper Startup Incorporated', '', 'Tel1', '', 0.5),
      ('AA098762D', 'Hyper Startup Inc.', 'Address2', 'Tel2', 'Description2', 0.3),
      ('BB099931J', 'Hyper-Startup Inc.', 'Address3', 'Tel3', 'Description3', 0.1),
      ('HH072982K', 'Hyper Hyper Inc.', 'Address4', '', 'Description4', 0.9),
      ('EE059082Q', 'Mega Enterprises Corp.', 'Address5', 'Tel5', 'Description5', 1.0)
   ],
   columns=('Customer ID', 'Customer Name', 'Address', 'Tel', 'Description', 'weight')
)
now = pd.Timestamp.now()
customers_df['timestamp'] = \
    pd.Series(
        pd.date_range(
            start=now - pd.Timedelta(len(customers_df) - 1, unit='d'), 
            end=now, 
            periods=len(customers_df)
        )
    ).iloc[::-1].reset_index(drop=True)
customers_df
Customer ID Customer Name Address Tel Description weight timestamp
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2021-03-02 23:02:45.884978
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2021-03-01 23:02:45.884978
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2021-02-28 23:02:45.884978
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2021-02-27 23:02:45.884978
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2021-02-26 23:02:45.884978
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 2021-02-25 23:02:45.884978

customers_df[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df['Customer Name'], 
        customers_df['Customer ID'], 
        min_similarity=0.6)
customers_df
Customer ID Customer Name Address Tel Description weight timestamp group rep ID group rep
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2021-03-02 23:02:45.884978 BB016741P Mega Enterprises Corporation
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2021-03-01 23:02:45.884978 CC082744L Hyper Startup Incorporated
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2021-02-28 23:02:45.884978 CC082744L Hyper Startup Incorporated
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2021-02-27 23:02:45.884978 CC082744L Hyper Startup Incorporated
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2021-02-26 23:02:45.884978 HH072982K Hyper Hyper Inc.
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 2021-02-25 23:02:45.884978 BB016741P Mega Enterprises Corporation
customers_df[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df['Customer Name'], 
        customers_df['Customer ID'], 
        group_rep='centroid', 
        min_similarity=0.6)
customers_df
Customer ID Customer Name Address Tel Description weight timestamp group rep ID group rep
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2021-03-02 23:02:45.884978 EE059082Q Mega Enterprises Corp.
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2021-03-01 23:02:45.884978 AA098762D Hyper Startup Inc.
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2021-02-28 23:02:45.884978 AA098762D Hyper Startup Inc.
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2021-02-27 23:02:45.884978 AA098762D Hyper Startup Inc.
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2021-02-26 23:02:45.884978 HH072982K Hyper Hyper Inc.
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 2021-02-25 23:02:45.884978 EE059082Q Mega Enterprises Corp.
customers_df[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df['Customer Name'], 
        customers_df['Customer ID'], 
        weights=customers_df['weight'], 
        group_rep='weight-based', 
        min_similarity=0.6)
customers_df
Customer ID Customer Name Address Tel Description weight timestamp group rep ID group rep
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2021-03-02 23:02:45.884978 EE059082Q Mega Enterprises Corp.
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2021-03-01 23:02:45.884978 CC082744L Hyper Startup Incorporated
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2021-02-28 23:02:45.884978 CC082744L Hyper Startup Incorporated
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2021-02-27 23:02:45.884978 CC082744L Hyper Startup Incorporated
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2021-02-26 23:02:45.884978 HH072982K Hyper Hyper Inc.
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 2021-02-25 23:02:45.884978 EE059082Q Mega Enterprises Corp.
customers_df[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df['Customer Name'], 
        customers_df['Customer ID'], 
        other_fields=customers_df, 
        group_rep='cleanest', 
        min_similarity=0.6)
customers_df
Customer ID Customer Name Address Tel Description weight timestamp group rep ID group rep
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2021-03-02 23:02:45.884978 BB016741P Mega Enterprises Corporation
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2021-03-01 23:02:45.884978 AA098762D Hyper Startup Inc.
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2021-02-28 23:02:45.884978 AA098762D Hyper Startup Inc.
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2021-02-27 23:02:45.884978 AA098762D Hyper Startup Inc.
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2021-02-26 23:02:45.884978 HH072982K Hyper Hyper Inc.
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 2021-02-25 23:02:45.884978 BB016741P Mega Enterprises Corporation
customers_df[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df['Customer Name'], 
        customers_df['Customer ID'], 
        timestamps=customers_df['timestamp'], 
        group_rep='oldest', 
        min_similarity=0.6)
customers_df
Customer ID Customer Name Address Tel Description weight timestamp group rep ID group rep
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2021-03-02 23:02:45.884978 EE059082Q Mega Enterprises Corp.
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2021-03-01 23:02:45.884978 BB099931J Hyper-Startup Inc.
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2021-02-28 23:02:45.884978 BB099931J Hyper-Startup Inc.
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2021-02-27 23:02:45.884978 BB099931J Hyper-Startup Inc.
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2021-02-26 23:02:45.884978 HH072982K Hyper Hyper Inc.
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 2021-02-25 23:02:45.884978 EE059082Q Mega Enterprises Corp.
justasojourner commented 3 years ago

I will have to take time to go through and digest this, there is a bit to read.

justasojourner commented 3 years ago

@ParticularMiner

Questions:

  1. You did some 'playing with pandas' to get the 'timestamp' column yes? I didn't see it in the first df, 'customers_df'. I would have just picked some dates out of the hat ;-) You'll be writing a pandas book soon :-)
  2. What if timestamps are not available? What if the database only has dates?
  3. What about timezone aware timestamps? If a timestamp is TZ aware then that means the (absolute) datetime value would have to be judged with respect to UTC time, that will make the whole process more difficult.
  4. It might be simpler to just use dates?
  5. Where are you getting the column titles from?
  6. You didn't number the examples so I will start with number 'i.' from the top to 'v.'. So we have:
    1. Vanilla grouping?
    2. 'Centroid' — can you please, in non-mathematical, plain English, explain this option ;-)
    3. 'Weight-Based' — the 'weight' column data comes via a database import? I don't think this will happen in a real-world scenario. I don't have any databases with a 'weight' column. Obviously to add such data manually is not an option, so if it is done programmatically then how are the weight values calculated?
    4. 'Cleanest' — based on what criteria? Is this checking rows with respect to finding rows that have the most filled in fields? With the premise that a row that has the most data is the most likely to be a 'good' record?
    5. Timestamp — looks promising! Pls see though my comments in points 2-4
ParticularMiner commented 3 years ago

Experimenting With Timestamps

String Grouper can use datetime data in a Series to determine group-representatives when grouping striings.

I find the various docs on datetime datatypes somewhat confusing, so I decided to play around with the datatype myself --- mostly for my own reference. Still, here are the results of my 'experiments'.

I'm here restricting myself to the pandas implementation of datetime: pandas.Timestamp.

NB. Currently, String Grouper's group_similar_strings function is naive to the pandas.Timestamp datatype. The optional parameter timestamps (required when group_rep='oldest') does not necessarily mean pandas.Timestamp datatypes are required --- it can accept decimals, integers, numpy datetime datatypes, pandas.Timestamp datatypes that have all been localized to the same timezone (see below), and so on. But it does not accept strings. Strings should first be converted to an acceptable datatype.

import pandas as pd
from string_grouper import group_similar_strings

The following function list_datatypes_of is just a utility to examine the datatypes of each column of a DataFrame:

def list_datatypes_of(dataframe):
    for col in range(len(dataframe.columns)): 
        print(f"DataType of {dataframe.columns[col]}: ", type(dataframe.iloc[:, col][1]))

Here's some artificial data I pulled out from the top of my head:

customers_df_timestamp_is_a_string = pd.DataFrame(
   [
      ('BB016741P', 'Mega Enterprises Corporation', 'Address0', 'Tel0', 'Description0', 0.2, '2014-12-30 10:55:00-02:00'),
      ('CC082744L', 'Hyper Startup Incorporated', '', 'Tel1', '', 0.5, '2017-01-01 20:23:15-05:00'),
      ('AA098762D', 'Hyper Startup Inc.', 'Address2', 'Tel2', 'Description2', 0.3, '2020-10-20 15:29:30+02:00'),
      ('BB099931J', 'Hyper-Startup Inc.', 'Address3', 'Tel3', 'Description3', 0.1, '2013-07-01 03:34:45-05:00'),
      ('HH072982K', 'Hyper Hyper Inc.', 'Address4', '', 'Description4', 0.9, '2005-09-11 11:56:00-07:00'),
      ('EE059082Q', 'Mega Enterprises Corp.', 'Address5', 'Tel5', 'Description5', 1.0, '1998-04-14 09:21:11+00:00')
   ],
   columns=('Customer ID', 'Customer Name', 'Address', 'Tel', 'Description', 'weight', 'timestamp')
)

Notice that each item of the 'timestamp' column is actually a string! And this is not apparent from the display:

customers_df_timestamp_is_a_string
Customer ID Customer Name Address Tel Description weight timestamp
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2014-12-30 10:55:00-02:00
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2017-01-01 20:23:15-05:00
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2020-10-20 15:29:30+02:00
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2013-07-01 03:34:45-05:00
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2005-09-11 11:56:00-07:00
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 1998-04-14 09:21:11+00:00

... nor from the DataFrame's dtypes method:

customers_df_timestamp_is_a_string.dtypes
Customer ID       object
Customer Name     object
Address           object
Tel               object
Description       object
weight           float64
timestamp         object
dtype: object

But it is clear from the utility function:

list_datatypes_of(customers_df_timestamp_is_a_string)
DataType of Customer ID:  <class 'str'>
DataType of Customer Name:  <class 'str'>
DataType of Address:  <class 'str'>
DataType of Tel:  <class 'str'>
DataType of Description:  <class 'str'>
DataType of weight:  <class 'numpy.float64'>
DataType of timestamp:  <class 'str'>

How does group_similar_strings fair with such data? Not well, as expected. That's because it has no way of determining the maximum of a group of strings. Try it yourself:

customers_df_timestamp_is_a_string[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df_timestamp_is_a_string['Customer Name'], 
        customers_df_timestamp_is_a_string['Customer ID'], 
        timestamps=customers_df_timestamp_is_a_string['timestamp'], 
        group_rep='oldest', 
        min_similarity=0.6)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

...
...
TypeError: reduction operation 'argmin' not allowed for this dtype

Now I'm going to convert the 'timestamp' column into pandas.Timestamp datatype. First, I'm just curious how pandas would save such data to a CSV file:

customers_df_timestamp_is_a_string.to_csv('customers_df_timestamp_is_a_string.csv')

Here are its contents:

,Customer ID,Customer Name,Address,Tel,Description,weight,timestamp
0,BB016741P,Mega Enterprises Corporation,Address0,Tel0,Description0,0.2,2014-12-30 10:55:00-02:00
1,CC082744L,Hyper Startup Incorporated,,Tel1,,0.5,2017-01-01 20:23:15-05:00
2,AA098762D,Hyper Startup Inc.,Address2,Tel2,Description2,0.3,2020-10-20 15:29:30+02:00
3,BB099931J,Hyper-Startup Inc.,Address3,Tel3,Description3,0.1,2013-07-01 03:34:45-05:00
4,HH072982K,Hyper Hyper Inc.,Address4,,Description4,0.9,2005-09-11 11:56:00-07:00
5,EE059082Q,Mega Enterprises Corp.,Address5,Tel5,Description5,1.0,1998-04-14 09:21:11+00:00

Let's get on with converting the 'timestamp' column into pandas.Timestamp datatype:

first copy the DataFrame into a new one:

customers_df_timestamp_is_a_true_pandas_timestamp = customers_df_timestamp_is_a_string.copy()

Now do the conversion:

customers_df_timestamp_is_a_true_pandas_timestamp['timestamp']=\
    customers_df_timestamp_is_a_true_pandas_timestamp['timestamp'].transform(lambda t: pd.Timestamp(t))

Nothing seems to have changed from the display:

customers_df_timestamp_is_a_true_pandas_timestamp
Customer ID Customer Name Address Tel Description weight timestamp
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2014-12-30 10:55:00-02:00
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2017-01-01 20:23:15-05:00
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2020-10-20 15:29:30+02:00
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2013-07-01 03:34:45-05:00
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2005-09-11 11:56:00-07:00
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 1998-04-14 09:21:11+00:00

... nor from the DataFrame's dtypes method:

customers_df_timestamp_is_a_true_pandas_timestamp.dtypes
Customer ID       object
Customer Name     object
Address           object
Tel               object
Description       object
weight           float64
timestamp         object
dtype: object

But the utility function has noticed the change: pandas.Timestamp is here!

list_datatypes_of(customers_df_timestamp_is_a_true_pandas_timestamp)
DataType of Customer ID:  <class 'str'>
DataType of Customer Name:  <class 'str'>
DataType of Address:  <class 'str'>
DataType of Tel:  <class 'str'>
DataType of Description:  <class 'str'>
DataType of weight:  <class 'numpy.float64'>
DataType of timestamp:  <class 'pandas._libs.tslibs.timestamps.Timestamp'>

group_similar_strings still does not work. Why? Because the timestamp data are not all from the same timezone. Try it yourself:

customers_df_timestamp_is_a_true_pandas_timestamp[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df_timestamp_is_a_true_pandas_timestamp['Customer Name'], 
        customers_df_timestamp_is_a_true_pandas_timestamp['Customer ID'], 
        timestamps=customers_df_timestamp_is_a_true_pandas_timestamp['timestamp'], 
        group_rep='oldest', 
        min_similarity=0.6)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

...
...    

TypeError: reduction operation 'argmin' not allowed for this dtype

To solve this problem let's try one more conversion:

convert to numpy.datetime64

We follow the same procedure as above:

customers_df_timestamp_is_a_numpy_datetime = customers_df_timestamp_is_a_true_pandas_timestamp.copy()
customers_df_timestamp_is_a_numpy_datetime['timestamp']=\
    customers_df_timestamp_is_a_numpy_datetime['timestamp'].transform(lambda t: t.to_numpy())

This time though you see a change in the time format: (Interesting!)

customers_df_timestamp_is_a_numpy_datetime
Customer ID Customer Name Address Tel Description weight timestamp
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2014-12-30 12:55:00
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2017-01-02 01:23:15
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2020-10-20 13:29:30
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2013-07-01 08:34:45
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2005-09-11 18:56:00
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 1998-04-14 09:21:11

The DataFrame's dtypes method notices!

customers_df_timestamp_is_a_numpy_datetime.dtypes
Customer ID              object
Customer Name            object
Address                  object
Tel                      object
Description              object
weight                  float64
timestamp        datetime64[ns]
dtype: object

Strangely the utility function does not register a change in datatype: (Interesting!)

list_datatypes_of(customers_df_timestamp_is_a_numpy_datetime)
DataType of Customer ID:  <class 'str'>
DataType of Customer Name:  <class 'str'>
DataType of Address:  <class 'str'>
DataType of Tel:  <class 'str'>
DataType of Description:  <class 'str'>
DataType of weight:  <class 'numpy.float64'>
DataType of timestamp:  <class 'pandas._libs.tslibs.timestamps.Timestamp'>

But ... YAY! group_similar_strings works:

customers_df_timestamp_is_a_numpy_datetime[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customers_df_timestamp_is_a_numpy_datetime['Customer Name'], 
        customers_df_timestamp_is_a_numpy_datetime['Customer ID'], 
        timestamps=customers_df_timestamp_is_a_numpy_datetime['timestamp'], 
        group_rep='oldest', 
        min_similarity=0.6)

... and here are the results:

customers_df_timestamp_is_a_numpy_datetime
Customer ID Customer Name Address Tel Description weight timestamp group rep ID group rep
0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2014-12-30 12:55:00 EE059082Q Mega Enterprises Corp.
1 CC082744L Hyper Startup Incorporated Tel1 0.5 2017-01-02 01:23:15 BB099931J Hyper-Startup Inc.
2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2020-10-20 13:29:30 BB099931J Hyper-Startup Inc.
3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2013-07-01 08:34:45 BB099931J Hyper-Startup Inc.
4 HH072982K Hyper Hyper Inc. Address4 Description4 0.9 2005-09-11 18:56:00 HH072982K Hyper Hyper Inc.
5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 1998-04-14 09:21:11 EE059082Q Mega Enterprises Corp.

I'm curious to see how things changed from stage to stage. So I summarize the conversion results here using the 'timestamp' column data from all the stages of conversion:

pd.DataFrame(
    {
        'string datatype': customers_df_timestamp_is_a_string['timestamp'], 
        'pandas.Timestamp datatype (time zone)': customers_df_timestamp_is_a_true_pandas_timestamp['timestamp'], 
        'pandas.Timestamp datatype (localized)': customers_df_timestamp_is_a_numpy_datetime['timestamp']
    }
)
string datatype pandas.Timestamp datatype (time zone) pandas.Timestamp datatype (localized)
0 2014-12-30 10:55:00-02:00 2014-12-30 10:55:00-02:00 2014-12-30 12:55:00
1 2017-01-01 20:23:15-05:00 2017-01-01 20:23:15-05:00 2017-01-02 01:23:15
2 2020-10-20 15:29:30+02:00 2020-10-20 15:29:30+02:00 2020-10-20 13:29:30
3 2013-07-01 03:34:45-05:00 2013-07-01 03:34:45-05:00 2013-07-01 08:34:45
4 2005-09-11 11:56:00-07:00 2005-09-11 11:56:00-07:00 2005-09-11 18:56:00
5 1998-04-14 09:21:11+00:00 1998-04-14 09:21:11+00:00 1998-04-14 09:21:11

The last conversion is also a "localization" (normalizing the pandas Timestamp data to the same timezone. That's how String Grouper likes it. (Examine the values to assure yourself that the localization is indeed correct. )

ParticularMiner commented 3 years ago

So to answer your questions:

What if timestamps are not available? What if the database only has dates?

In that case, still use pandas Timestamps. (pandas simply sets their time portions to 00:00:00)

What about timezone aware timestamps? If a timestamp is TZ aware then that means the (absolute) datetime value would have to be judged with respect to UTC time, that will make the whole process more difficult.

The user only need use pandas Timestamp facillities to localize/convert all the different timezone data into a single timezone. to_numpy() is sufficient as shown in my previous message.

It might be simpler to just use dates?

I don't think so. Dates are simply datetime or Timestamps with the time part ignored (or set to 00:00:00). So Datetimes can be used as dates. In fact, I do not know of any date-only datatype. Do you?

Where are you getting the column titles from?

I specify the column titles upon creation of the DataFrame customers_df. The additional columns (like 'group rep id' and 'group rep') are obtained through assignment.

Vanilla grouping?

You are referring to the default? This is the same as setting the option group_rep='first'. String Grouper uses the first string in the group as representative. It therefore is based on the order in which the strings appear.

'Centroid' — can you please, in non-mathematical, plain English, explain this option ;-)

I'll try ... so each string has a number of duplicates to which it has been matched and each match has a similarity rating. Adding up the similarities ratings over all the matches for a particular string gives you the weight of that string. The string in the group with the highest weight is the centroid of the group, and is then chosen as the representative.

'Weight-Based' — the 'weight' column data comes via a database import? I don't think this will happen in a real-world scenario. I don't have any databases with a 'weight' column. Obviously to add such data manually is not an option, so if it is done programmatically then how are the weight values calculated?

You're quite right. In these examples, I just chose the easiest way to manufacture a weight Series: by extracting them from a column in the database. But that is only my choice. You can input any Series there as long as it has the same number of rows as the strings to group.

'Cleanest' — based on what criteria? Is this checking rows with respect to finding rows that have the most filled in fields? With the premise that a row that has the most data is the most likely to be a 'good' record?

Yes, it is exactly as you say.

ParticularMiner commented 3 years ago

Importing Timestamps or Dates from CSV files

import pandas as pd
from string_grouper import group_similar_strings

So I've got a CSV file "_customers_df_timestamp_is_a_true_pandastimestamp.csv" containing, among other things, timestamps in a string format. I will use the following utility function to display its contents:

def display_file_contents(file):
    with open(file, 'r') as f:
        contents = f.read()
    print(contents)
display_file_contents("customers_df_timestamp_is_a_true_pandas_timestamp.csv")
,Customer ID,Customer Name,Address,Tel,Description,weight,timestamp
0,BB016741P,Mega Enterprises Corporation,Address0,Tel0,Description0,0.2,2014-12-30 10:55:00-02:00
1,CC082744L,Hyper Startup Incorporated,,Tel1,,0.5,2017-01-01 20:23:15-05:00
2,AA098762D,Hyper Startup Inc.,Address2,Tel2,Description2,0.3,2020-10-20 15:29:30+02:00
3,BB099931J,Hyper-Startup Inc.,Address3,Tel3,Description3,0.1,2013-07-01 03:34:45-05:00
4,HH072982K,Hyper Hyper Inc.,Address4,,Description4,0.9,2005-09-11 11:56:00-07:00
5,EE059082Q,Mega Enterprises Corp.,Address5,Tel5,Description5,1.0,1998-04-14 09:21:11+00:00

As you can see, the 'timestamp' column has datetime + timezone information stored as strings in the file.

The main lesson learned from the previous notebook session is this: datetime-timezone data in the form of strings must first be converted to pandas Timestamps and then localized to the same timezone. And the way to do this is to use the function pandas.Timestamp(string_value).to_numpy(). I've since learned that you could alternatively use pandas.to_datetime(string_value).to_numpy().

But is there a way to do all this while importing the data from the file into the pandas Dataframe? Fortunately, the answer is, "Yes!"

def my_parser(dt):
    return pd.Timestamp(dt).to_numpy()

customer_df_parsed_from_csv=\
    pd.read_csv(
        'customers_df_timestamp_is_a_true_pandas_timestamp.csv', 
        parse_dates=['timestamp'], 
        date_parser=my_parser
    )

Let's display the contents of our new Dataframe:

customer_df_parsed_from_csv
Unnamed: 0 Customer ID Customer Name Address Tel Description weight timestamp
0 0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2014-12-30 12:55:00
1 1 CC082744L Hyper Startup Incorporated NaN Tel1 NaN 0.5 2017-01-02 01:23:15
2 2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2020-10-20 13:29:30
3 3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2013-07-01 08:34:45
4 4 HH072982K Hyper Hyper Inc. Address4 NaN Description4 0.9 2005-09-11 18:56:00
5 5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 1998-04-14 09:21:11

Et Voilà! Importation, conversion and localization have all taken place in a single step.

Once again, let's make of of the utility function list_datatypes_of to examine the datatypes of each column of the DataFrame:

def list_datatypes_of(dataframe):
    for col in range(len(dataframe.columns)): 
        print(f"DataType of {dataframe.columns[col]}: ", type(dataframe.iloc[:, col][1]))
list_datatypes_of(customer_df_parsed_from_csv)
DataType of Unnamed: 0:  <class 'numpy.int64'>
DataType of Customer ID:  <class 'str'>
DataType of Customer Name:  <class 'str'>
DataType of Address:  <class 'float'>
DataType of Tel:  <class 'str'>
DataType of Description:  <class 'float'>
DataType of weight:  <class 'numpy.float64'>
DataType of timestamp:  <class 'pandas._libs.tslibs.timestamps.Timestamp'>
customer_df_parsed_from_csv.dtypes
Unnamed: 0                int64
Customer ID              object
Customer Name            object
Address                  object
Tel                      object
Description              object
weight                  float64
timestamp        datetime64[ns]
dtype: object

group_similar_strings will therefore work with it nicely:

customer_df_parsed_from_csv[['group rep ID', 'group rep']] = \
    group_similar_strings(
        customer_df_parsed_from_csv['Customer Name'], 
        customer_df_parsed_from_csv['Customer ID'], 
        timestamps=customer_df_parsed_from_csv['timestamp'], 
        group_rep='oldest', 
        min_similarity=0.6)

So it gives the following result:

customer_df_parsed_from_csv
Unnamed: 0 Customer ID Customer Name Address Tel Description weight timestamp group rep ID group rep
0 0 BB016741P Mega Enterprises Corporation Address0 Tel0 Description0 0.2 2014-12-30 12:55:00 EE059082Q Mega Enterprises Corp.
1 1 CC082744L Hyper Startup Incorporated NaN Tel1 NaN 0.5 2017-01-02 01:23:15 BB099931J Hyper-Startup Inc.
2 2 AA098762D Hyper Startup Inc. Address2 Tel2 Description2 0.3 2020-10-20 13:29:30 BB099931J Hyper-Startup Inc.
3 3 BB099931J Hyper-Startup Inc. Address3 Tel3 Description3 0.1 2013-07-01 08:34:45 BB099931J Hyper-Startup Inc.
4 4 HH072982K Hyper Hyper Inc. Address4 NaN Description4 0.9 2005-09-11 18:56:00 HH072982K Hyper Hyper Inc.
5 5 EE059082Q Mega Enterprises Corp. Address5 Tel5 Description5 1.0 1998-04-14 09:21:11 EE059082Q Mega Enterprises Corp.

The same procedure can be used even if the strings in the CSV data file have date-only information. pandas merely treats them as pandas Timestamps with time part '00:00:00'. Try it just to be sure!

ParticularMiner commented 3 years ago

Hi @Bergvca , @justasojourner ,

I've amended the timestamps argument of group_similar_strings to enable it to parse strings that are date- or datetime-like.

So now there is no need to go through the hassle of converting date/datetime-like strings or localizing timezone-aware datetimes yourself. You can simply input strings. However the strings should represent valid dates or datetimes with or without timezone information.

Cheers!

Bergvca commented 3 years ago

Hi @ParticularMiner - I think the code itself looks good, however I feel like it is getting too complex and large.

In the end the timestamp, cleanest and weight solution are essentially the same thing (unless I'm misunderstanding), only we are providing the code to calculate the weights. Even the weight solution could be removed by just ordering the input data by the weights, and making the "first" string option explicit. I don't think code that parses timestamps for example belongs in the StringGrouper class. Neither does the code that checks which row in a dataframe is the cleanest.

I don't want to see the code wasted though, maybe it can be put into a seperate file with utility functions?

So:

What do you think?

ParticularMiner commented 3 years ago

Hi @Bergcva,

Thanks.

... maybe it can be put into a seperate file with utility functions?

Sure I could do this. I see your point. Perhaps you could give me some guidance on how to do that. Would this new file be a new module that can be imported together with string_grouper?

ParticularMiner commented 3 years ago

Hi @Bergvca

I think I now fully understand your proposal to restructure/modularize the code.

You might have already noticed from the current code, that I avoid to sort the input data throughout the entire deduplication and group-representative selection process. So before implementing your proposal, let me first confer with you about the following:

While at its very core the process of selecting a group representative is a mere weighting or sorting process, there are a few pros and cons to consider when choosing one over the other:

  1. Weighting preserves the original order of the input data. This could be an attractive feature to those users who may not want the order of their data to be disturbed, and helps them avoid the hassle of taking measures to somehow stash the preferred order of the data before using String Grouper and then restoring it afterwards.

  2. Performance: sorting has a complexity that scales at least as O(N log N) where N is the size of the data, while finding a maximum weight within a group has a complexity of at least O(G) where G is the size of the group. But because the 'group by' algorithm itself is also O(N log N), the sorting method wins by performance.

  3. As you mentioned in an earlier message (and I believe you were right), the current ability of pandas 'group by' to preserve the original order of group members is more of a side-effect than an essential aspect of it. Since we cannot guarantee that it will not be deprecated it the future, perhaps it would be more prudent to choose functionality that does not depend critically on this side-effect, namely, weighting.

ParticularMiner commented 3 years ago

How about this:

a separate module _string_grouperutils.py containing three functions:

  1. new_group_rep_by_timestamp
  2. new_group_rep_by_completeness
  3. new_group_rep_by_weight

that each have two required arguments:

  1. grouped_data: the input DataFrame already grouped (by String Grouper), and thus already has a 'group' column.
  2. group_col: the name (of str type) or index (of int type) of the column in grouped_data having the groups.

Meanwhile, our original module _stringgrouper.py retains only the two group_rep options: 'first' and 'centroid' (with 'centroid' as the default) and discards the other three options.

Thus any user that is not satisfied with String Grouper's choice of group representatives may then do a standard import:

from string_grouper_utils import new_group_rep_by_timestamp, new_group_rep_by_completeness, new_group_rep_by_weight

and change the group representatives as desired.

Does that fully capture your proposal?