RobinL / fuzzymatcher

Record linking package that fuzzy matches two Python pandas dataframes using sqlite3 fts4
MIT License
281 stars 60 forks source link

Deduping a single df #55

Open matt-erhart opened 4 years ago

matt-erhart commented 4 years ago

I'm looking to do this on one df (and then on more than 2). Is there a way to do this that I've missed in the docs?

andrewRowlinson commented 4 years ago

I posted a potential solution in #56

I hacked together a solution for my dataset. My solution joins the dataframe to a copy of itself and exclude any links where your unique identifier matches. In my case, it was for football players. I have one column I need to de-duplicate on: 'Name' and an identifier: 'player_id'.

import pandas as pd
import numpy as np
import os
from fuzzymatcher.data_preprocessor_default import DataPreprocessor
from fuzzymatcher.data_getter_sqlite import DataGetter
from fuzzymatcher.scorer_default import Scorer
from fuzzymatcher.matcher import Matcher

# load a dataframe

df_all_shots = pd.read_parquet(os.path.join('..', 'data', 'shots.parquet'))

# split into a left and right copy
df_left = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()
df_right = df_all_shots[['player_id', 'Name']].drop_duplicates('player_id').copy()

# create a match object
dp = DataPreprocessor()
dg = DataGetter()
s = Scorer()
m = Matcher(dp, dg, s)

# match the data with a copy of itself
m.add_data(df_left, df_right, left_on='Name', right_on='Name',  left_id_col='player_id', right_id_col='player_id')
m.match_all()
df_duplicated = m.link_table

# subset the data where the ids don't match - these are the duplicates
df_duplicated = df_duplicated[df_duplicated.__id_left != df_duplicated.__id_right].copy()

# some duplicates will be matched twice (once left/ once right), so keep one copy.
# I had a numeric id so I create columns with the min/max of the ids and de-duplicated 
# based on these new id columns
df_duplicated['id1'] = df_duplicated[['__id_left', '__id_right']].min(axis=1)
df_duplicated['id2'] = df_duplicated[['__id_left', '__id_right']].max(axis=1)
df_duplicated.drop(['__id_left', '__id_right', '__rank'], axis=1, inplace=True)
df_duplicated.drop_duplicates(inplace=True)

# merge back on the columns you want to keep
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id1', right_on='player_id')
df_duplicated = df_duplicated.merge(df_left, how='left', left_on='id2', right_on='player_id', suffixes=['_1', '_2'])

Here's my result: image