Lyonk71 / pandas-dedupe

Simplifies use of the Dedupe library via Pandas
135 stars 30 forks source link

Deduping removes '@' from email records. #22

Closed soliverc closed 3 years ago

soliverc commented 5 years ago

Customer records now have undeliverable addresses. Is there any way to avoid this?

Lyonk71 commented 5 years ago

Yes. Instead of installing from PYPI, download the package directly from GitHub. In the download, navigate to pandas_dedupe>utility_functions.py.

Change the following function so that df[i] = df[i].str.replace('[^\w\s.-()\,\:\/\\]','') also does not remove @ symbol:

def clean_punctuation(df):
    for i in df.columns:
        df[i] = df[i].astype(str) 
    df = df.applymap(lambda x: x.lower())
    for i in df.columns:
        df[i] = df[i].str.replace('[^\w\s\.\-\(\)\,\:\/\\\\]','')
    df = df.applymap(lambda x: trim(x))
    df = df.applymap(lambda x: unidecode(x))
    for i in df.columns:
        df[i] = df[i].replace({'nan': None, 'none': None, 'nat': None})
    return df

Then you can follow online instruction on how to install a package manually. make sure you pip uninstall pandas-dedupe first.

Let me know if that doesn't make sense.

I meant to provide a parameter in this package to allow users to change the default string cleaning - feel free to put in a pull request if you end up making a more permanent fix!

soliverc commented 5 years ago

OK thanks. So I should change the regex from:

df[i] = df[i].str.replace('[^\w\s\.\-\(\)\,\:\/\\\\]','')

to:

df[i] = df[i].str.replace('[^\w\s.-(),:/\\]','')

? I am not great with regex so just double checking.

I will be happy to contribute if I get this working.

Lyonk71 commented 5 years ago

Sorry, my str.replace line was a copy/paste error. You mentioned you're new to regex; here's an explanation of the original regex:

df[i].str.replace('[^\w\s\.\-\(\)\,\:\/\\\\]','') The ^ stands for not. \w represents any letter or numeric character \s represents spaces

So , df[i].str.replace('[^\w\s]','') means, don't remove letters numbers or spaces, but remove everything else.

You can also call out special characters by putting a \ followed by the character, so\. means periods,\- for hyphens. So in order to keep the @ symbol, you would use \@

So your update line would look something like:

df[i].str.replace('[^\w\s\@\.\-\(\)\,\:\/\\\\]','')

soliverc commented 5 years ago

OK thanks .. i noticed all text is also converted to lower() during deduping. So I will use str.title() to revert all customer records back to the original form.

Lyonk71 commented 5 years ago

Great. After chatting with you, I'm thinking there definitely needs to be a parameter that allows people to toggle standardization. I'm picturing something like this: pandas_dedupe.dedupe_dataframe(df, ['field_1', 'field_2', ... ], standardize=True)

It could default to True, but allow user to set False for no automatic formatting/cleaning.

if standardize == True:
    clean_punctuation()
elif standardize == False:
    pass
else:
    Raise('standardize parameter must be True or False')

I'm adding a new feature request for this under the "issues" tab. Like I said, you're welcome to fork the project and send in a pull request - it's not terribly urgent, so let me know if that sounds interesting. I'd be happy to break the steps out more if you're new to projects like this. Otherwise, I'll probably write the implementation in the next couple of weeks.

soliverc commented 5 years ago

So I've updated the package and manually installed it to test it out. My dataframe is now being returned in original titlecase and emails are still valid.

In utility_functions.py I have edited the regex:

def clean_punctuation(df):
    for i in df.columns:
        df[i] = df[i].astype(str) 
    df = df.applymap(lambda x: x.lower())
    for i in df.columns:
        df[i] = df[i].str.replace('[^\w\s\@\.\-\(\)\,\:\/\\\\]','')
    df = df.applymap(lambda x: trim(x))
    df = df.applymap(lambda x: unidecode(x))
    for i in df.columns:
        df[i] = df[i].replace({'nan': None, 'none': None, 'nat': None})
    return df

And for converting my dataframe back to titlecase I have added an applymap() operation to dedupe_dataframe.py.

It's working perfectly now, however I'm not sure if I set this up correctly. I added a choice for titlecase in the function def dedupe_dataframe. You can see by default I have return_titlecase=False.

def dedupe_dataframe(df, field_properties, canonicalize=False,
                     config_name="dedupe_dataframe", recall_weight=1,
                     sample_size=0.3,return_titlecase=False):

Then at the very end, I check if return_titlecase=True and if it is True, the titlecase operation is done with this line:

# if user chooses return_titlecase=True
# convert dataframe to titlecase
# converts strings only - this avoids converting
# ints such as confidence score and cluster ID

results = results.applymap(lambda x: str(x).title() if isinstance(x,str) and pd.notnull(x) else x)

Here is the def dedupe_dataframe() function in full:

def dedupe_dataframe(df, field_properties, canonicalize=False,
                     config_name="dedupe_dataframe", recall_weight=1,
                     sample_size=0.3,return_titlecase=False):
    """Deduplicates a dataframe given fields of interest.

        Parameters
        ----------
        df : pd.DataFrame
            The dataframe to deduplicate.
        field_properties : list
            A list specifying what fields to use for deduplicating records.
        canonicalize : bool or list, default False
            Option that provides the canonical records as additional columns.
            Specifying a list of column names only canonicalizes those columns.
        config_name : str, default dedupe_dataframe
            The configuration file name. Note that this will be used as 
            a prefix to save the settings and training files.
        recall_weight : int, default 1
            Find the threshold that will maximize a weighted average of our
            precision and recall.  When we set the recall weight to 2, we are
            saying we care twice as much about recall as we do precision.
        sample_size : float, default 0.3
            Specify the sample size used for training as a float from 0 to 1.
            By default it is 30% (0.3) of our data.
        return_titlecase : bool, default False
            Applies titlecase to all elements in the returned Dataframe.

        Returns
        -------
        pd.DataFrame
            A pandas dataframe that contains the cluster id and confidence
            score. Optionally, it will contain canonicalized columns for all
            attributes of the record.

    """
    # Import Data  
    config_name = config_name.replace(" ", "_")

    settings_file = config_name + '_learned_settings'
    training_file = config_name + '_training.json'

    print('importing data ...')

    df = clean_punctuation(df)

    specify_type(df, field_properties)                

    df['dictionary'] = df.apply(
        lambda x: dict(zip(df.columns, x.tolist())), axis=1)
    data_d = dict(zip(df.index, df.dictionary))

    # train or load the model
    deduper = _train(settings_file, training_file, data_d, field_properties,
                     sample_size)

    # ## Set threshold
    threshold = deduper.threshold(data_d, recall_weight=recall_weight)

    # cluster the records
    clustered_df = _cluster(deduper, data_d, threshold, canonicalize)
    results = df.join(clustered_df, how='left')
    results.drop(['dictionary'], axis=1, inplace=True)

    # reverting Nonetypes back to Nan
    results.fillna(value=pd.np.nan, inplace=True)

    # if user chooses return_titlecase=True
    # convert dataframe to titlecase
    # converts strings only - this avoids converting
    # ints such as confidence score and cluster ID
    if return_titlecase == True:
        results = results.applymap(lambda x: str(x).title() if isinstance(x,str) and pd.notnull(x) else x)

    return results
soliverc commented 5 years ago

Just created a pull request... my first one. Hopefully I've done it correctly!

cfatls commented 3 years ago

Thanks for this! this Library has been extremely helpful and this thread solved the same issue I had. Very clear and concise explanation even for a newbie like me. Greetings from Colombia.