Open berndnoll opened 3 years ago
Hi @berndnoll
Thanks for the comment.
It seems to me that your request involves applying string_grouper
’s function match_strings
on several fields of your dataset one after the other (using one Series
for each field, preferably indexed) and merging the resulting similarity scores on the output fields left_id
and right_id
(that is, withhow='outer'
and on=['left_id', 'right_id']
. See pandas documentation).
Exact comparisons may be achieved by specifying a threshold similarity score very close to 1.0 (for example, min_similarity=0.9999
) while fuzzy matches can be specified with a lower desired score.
Furthermore, note that before applying each merge operation, the output fields similarity
would need to be renamed in both operands to the name of the corresponding field that produced the similarity score.
I hope this answers your question.
Hi @ParticularMiner, thank you very much for your helpful answer. I managed to implement the merge operation and got this to work according to your advice. Performance is fine with a small dataset (40k records).
However, I have a pretty large dataset (>3Mil records) and matching takes foreeeever. (already for one field) So my follow up question is if you have any great ideas on how to improve performance for larger datasets? Some record linkage libs provide the ability of blocking (e.g. by country, zipcode) which can significantly reduce matching time.
Thanks
Hi @berndnoll
I’m glad to hear you succeeded in achieving your goal.
You could play around with the string_grouper
option max_n_matches
which may be used to limit the number of matches found per record and thus speed things up a bit. Start small, say with max_n_matches=20
and increase until your output doesn’t change anymore.
Hi @berndnoll,
I also use this approach and found it very good and quick compared to recordlinkage that you mentioned above. Since you want to use exact match for some fields, why don't you break your whole data set by subsets like by state AND / OR suburb, and run the match_strings procedure for each subset separately. That would save a LOT of computation. Then you would merge the resulting dataframes.
@ParticularMiner, what do you think?
Thank you!
@iibarant, thanks for the advice. That's exactly what I implemented today. However, running multiple matches for small subsets takes actually way more time than running one match over a big set... I am using country and zip code to break the data into subsets. I think ideally blocking/exact matching should be part of the "inner" algorithm. Just my 2 cents.
@iiberant @berndnoll
Thanks for your contributions. If you don't mind, could you share your code (and perhaps some accompanying sample data) so I can properly examine/assess your methods? I'm quite interested to know.
Hi guys,
Unfortunately I am not allowed to share the data, but here's the experiment I ran with 7 states and the code I used. I merged the address fields into 'full address' column and kept 'foad' (State) separately.
The existing column names are ['document_id','name','foad','full address'].
Please see the RunTime results:
The code for merging subsets is here:
import time
t0 = time.time()
states = ['MT','DE','HI','ID','RI','DC','ME']
match_list = []
for state in states:
df = df_cleaned[df_cleaned['foad'] == state].copy()
matches = match_strings(df['full address'],max_n_matches = 300, min_similarity = 0.85)
match_list.append(matches)
final_match = pd.concat(match_list, ignore_index=True)
t = time.time()-t0
print("Ran for:", t, " seconds with match size = ",len(final_match)
Running data separately by states and merging them after words takes significantly less time.
I have to try that for big states like California, NY, Florida and Texas. It will surely be faster to run separately and merge after.
Thank you.
Thank you @iibarant !
I suppose separating the data by state makes complete sense especially if any two addresses not in the same state never need to be matched (that is, if I have interpreted your data correctly). Then, certainly, separating the data will proceed much faster than not separating the data.
It seems to me that @berndnoll (correct me if I’m wrong) has a somewhat different task which requires matching different fields for the same records and merging the results. (That is, he needs to merge along the horizontal axis rather than the vertical axis, as you do. So he might proceed differently.
Hi gents, unfortunately not able to share the original file, but the one attached will do for our purposes. My code below. Grouping/blocking can be switched on/off by setting attribute 'group' to True/False in line 13.
I did concatenate zipcode and city to be as close to my original data. In the original data file I am doing this with country + zipcode, as country keys like NL, DE, US, ... are too short to deliver good results.
My observations are: Running this grouped: 27 seconds Ungrouped: 3.2 seconds
However, this might be due to my poor py skills... Looking forward to some pro comments :) Ideally, grouping/blocking would be part of the actual string_grouper implementation, but I leave that up to @ParticularMiner ...
Cheers
us-cities-real-estate-sample-zenrows.zip
Removing scrambled code --- see below for correclty formatted version.
Hi @berndnoll
Many thanks for your code and sample data. Unfortunately important indentations/tabs in your code got lost in your message, making some portions of the code unreadable. It would be helpful to the reader if you would bracket your code-snippets with the required markdown directives (```python and ```) as follows:
Code such as the following:
if using_github == True:
sample_code(goes, here)
should be coded as follows:
```python if using_github == True: sample_code(goes, here) ```
In this way, the indentations are preserved. Follow this link for more information.
@ParticularMiner: Sorry for that, trying one more time...
import sys, csv, json, pandas as pd
import numpy
import string_grouper as sg
import time
print('Dedupe script')
t0 = time.time()
inputfilename = 'us-cities-real-estate-sample-zenrows.csv'
outputfilename = 'dupes.csv'
# match field definitions - groups have to go first
comps = [
{'fieldname':'addressZipcode+addressCity','threshold':0.99,'weight':1,'valsep':'|','group':False},
{'fieldname':'addressStreet','threshold':0.8,'weight':1,'valsep':'|'}
]
keyfield = 'id'
mintotalscore = 0.8
df = pd.read_csv(inputfilename, dtype = str)
print('#Records:',len(df)-1)
df.fillna('', inplace=True)
lkfld = 'left_'+keyfield
rkfld = 'right_'+keyfield
keyfields = [lkfld,rkfld]
allfields = keyfields.copy()
allfields.append('totalscore')
outmatches = pd.DataFrame()
first = True
gfield = ''
for comp in comps:
sim = 0.8 if 'threshold' not in comp else comp['threshold']
weight = 1 if 'weight' not in comp else comp['weight']
group = False if 'group' not in comp else comp['group']
valsep = '|' if 'valsep' not in comp else comp['valsep']
field = comp['fieldname']
flds = field.split('+')
if len(flds)>1: # Multiple fields? Create concatenated field
field = '_'.join(flds)
df[field] = df[flds].apply(lambda row: valsep.join(row.values.astype(str)), axis=1)
allfields.append('left_'+field)
allfields.append('right_'+field)
allfields.append(field+'_sim')
simfn = field + '_sim'
if group == True:
gfield = field
gweight = weight
gvals = df[gfield].unique().tolist()
continue # No matches for this field, as it's group
print('Field: ',field)
if gfield != '':
print(' group field=',gfield)
# Perform match per each group
matches = pd.DataFrame()
for gval in gvals:
gdf = df[ df[gfield] == gval ]
strs = gdf[field].astype(str)
keys = gdf[keyfield].astype(str)
gmatches = pd.DataFrame()
if len(strs)>1:
#print('group value:', gval)
# Perform the match
gmatches = sg.match_strings(strs, min_similarity=sim, master_id = keys, ignore_index=True, max_n_matches=20)
gmatches = gmatches[gmatches[lkfld] != gmatches[rkfld]]
gmatches['similarity'] = gmatches['similarity'] + 1 # Add a full score for the group
gmatches['left_'+gfield] = gval
gmatches['right_'+gfield] = gval
gmatches[gfield+'_sim'] = 1
matches = matches.append(gmatches)
else:
strs = df[field].astype(str)
keys = df[keyfield].astype(str)
# Perform the match
matches = sg.match_strings(strs, min_similarity=sim, master_id = keys, ignore_index=True, max_n_matches=20)
# Only keep matches that have different keys
matches = matches[matches[lkfld] != matches[rkfld]]
matches['similarity'] = matches['similarity']*weight
matches.rename(columns={'similarity': simfn }, inplace=True )
# Merge results
if first == True: # First field
outmatches = matches
outmatches['totalscore'] = matches[simfn]
first = False
else:
outmatches = outmatches.merge(matches,on=keyfields,how='outer')
outmatches['totalscore'] = outmatches['totalscore'].add(outmatches[simfn])
def onetwo(row):
if not hasattr(onetwo, "number"):
onetwo.number = 0
if onetwo.number == 1:
onetwo.number = 2
else:
onetwo.number = 1
return onetwo.number
# Filter
outmatches = outmatches[outmatches[lkfld] != outmatches[rkfld]]
outmatches = outmatches[outmatches.totalscore >= mintotalscore]
# Eliminate duplicates
outmatches.sort_values([lkfld,rkfld])
outmatches['onetwo'] = outmatches.apply(lambda row: onetwo(row), axis=1)
delkeys = outmatches[rkfld].unique().tolist()
outmatches = outmatches[(outmatches[lkfld].isin(delkeys)) & (outmatches['onetwo'] != 2)]
# Output
nummatches = len(outmatches)
if nummatches > 0:
t = time.time()-t0
print('duration:', t)
print('output:', nummatches)
outmatches.to_csv(outputfilename, sep='\t', index=False, columns=allfields)
print('Done.')
@berndnoll @iibarant
Thanks again for your contributions and suggestions. I have borrowed from the code snippets you both provided to produce the following code:
[Follow this link to obtain the most recent code.]
I'd appreciate it if you would test and review it, so as to determine if it would be useful enough to add to string_grouper
's utility package string_grouper_utils
.
Of course feel free to discuss any issues.
FYI, using the sample data that @berndnoll provided, I found that grouping the data for exact matches before computing the similarities does not necessarily lead to faster nor slower computation. In fact, the speed is very much data-dependent.
@iibarant @berndnoll
CAUTION:
You may have already noticed that grouping the exact matches before similarity-matching often leads to results that are different from those obtained without grouping. This is because of the way similarity-matching works: similarity-scores are dependent on the vocabulary-set (or corpus, as it's called in Natural Language processing).
This means that calling match_strings()
on only a subset, or group, of the data (for example, all records whose 'state' field has the value 'AL') would in general yield different similarity-scores for those same records if match_strings()
were called on the entire dataset. That's because the corpus in each case is different — in the former, it is the group, and in the latter, it is the entire dataset.
This is why it might be better not to group at all. Nevertheless, the choice is always up to the user, as long as he/she knows what he/she is doing.
Somehow, I had forgotten about this fact before. Sorry about that. I hope that hasn't caused you too much inconvenience.
Thank you @ParticularMiner,
This is a good point. That's why I decided to combine states with much lower data collection (in my case < 10000) all together. Hope this would be acceptable.
Thank you for your solution. I will explore it next week.
Have a great weekend!
@berndnoll @iibarant
By modifying the StringGrouper
class a little bit, I have been able to fix the problem of inconsistent results between the two methods. So please note the update to the code in my previous post. Now you can obtain the same result whether or not you first group by exact fields.
@berndnoll @iibarant
Latest update now includes total similarity scores.
@ParticularMiner Thank you so much for looking into this and for collaborating on this. Very much appreciated!
I basically copy and pasted your code to test it before applying to a bigger dataset. After 30 minutes I restarted the script ... same result. Reducing the input volume from 10,000 to 1,000 records makes it run through, but ehm ... I will not let it work on 3 mil rows for sure.
Here is the output:
Dedupe script T1: 1.159604787826538 T2: 55.857362031936646 Done.
So ... either I got something wrong or you have a quantum computer? What was the runtime of your examples?
Here is my code - based on the latest update you posted.
Cheers
import sys, csv, json, pandas as pd
import numpy as np
import string_grouper as sg
import time
import functools
from string_grouper import StringGrouper
def record_linkage(data_frame,
fields_2b_matched_fuzzily,
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=5000,
similarity_dtype=np.float32
):
'''
Function that combines similarity-matching results of several fields of a DataFrame and returns them in another
DataFrame
:param data_frame: pandas.DataFrame of strings.
:param fields_2b_matched_fuzzily: List of tuples. Each tuple is a triple
(<field name>, <threshold>, <ngram_size>).
<field name> is the name of a field in data_frame which is to be matched using a threshold similarity score of
<threshold> and an ngram size of <ngram_size>.
:param fields_2b_matched_exactly: List of strings. Each string is a field name of data_frame. Defaults to None.
:param hierarchical: bool. Determines if the output DataFrame will have a hierarchical column-structure (True) or
not (False). Defaults to True.
:param max_n_matches: int. Maximum number of matches allowed per string.
:param similarity_dtype: numpy type. Either np.float32 (the default) or np.float64. A value of np.float32 allows
for less memory overhead during computation but less numerical precision, while np.float64 allows for greater
numerical precision but a larger memory overhead.
:return: pandas.DataFrame containing matching results.
'''
class VariableStringSetStringGrouper(StringGrouper):
# This class enables StringGrouper to apply matching on different succesive master datasets without
# resetting the underlying corpus, as long as each master dataset is contained in the corpus.
# This class inherits from StringGrouper, overwriting only two of its methods:
# __init__() and _get_tf_idf_matrices()
def __init__(self, corpus, **kwargs):
# initializer is the same as before except that it now also sets the corpus
super().__init__(corpus, **kwargs)
self._vectorizer = self._fit_vectorizer()
def _get_tf_idf_matrices(self):
# _get_tf_idf_matrices() now no longer sets the corpus but rather builds the matrices from
# the existing corpus
master_matrix = self._vectorizer.transform(self._master)
return master_matrix, master_matrix
def match_strings(master, sg):
sg._master = master
sg = sg.fit()
return sg.get_matches()
def get_only_field_names(fields_2b_matched_fuzzily):
return list(list(zip(*fields_2b_matched_fuzzily))[0])
def build_column_precursor_to(df, exact_field_value_pairs):
exact_df = df.iloc[:, 0:0]
for field_name, field_value in exact_field_value_pairs:
exact_df[field_name] = field_value
return exact_df
def horizontal_linkage(df,
match_indexes,
fields_2b_matched_fuzzily,
exact_field_value_pairs=None,
hierarchical=True):
horizontal_merger_list = []
for field, sg in fields_2b_matched_fuzzily:
matches = match_strings(df[field], sg)
matches.set_index(match_indexes, inplace=True)
matches = weed_out_trivial_matches(matches)
if hierarchical:
merger = matches[[f'left_{field}', 'similarity', f'right_{field}']]
merger.rename(columns={f'left_{field}': 'left', f'right_{field}': 'right'}, inplace=True)
else:
merger = matches[['similarity']]
merger.rename(columns={'similarity': field}, inplace=True)
horizontal_merger_list += [merger]
field_list = get_only_field_names(fields_2b_matched_fuzzily)
key_list = None if not hierarchical else field_list
merged_df = pd.concat(horizontal_merger_list, axis=1, keys=key_list, join='inner')
title_exact = 'Exactly Matched Fields'
title_fuzzy = 'Fuzzily Matched Fields'
if exact_field_value_pairs:
exact_df = build_column_precursor_to(merged_df, exact_field_value_pairs)
merged_df = pd.concat([exact_df, merged_df],
axis=1,
keys=[title_exact, title_fuzzy],
join='inner')
totals = compute_totals(merged_df, field_list, hierarchical, exact_field_value_pairs, title_fuzzy)
return pd.concat([totals, merged_df], axis=1)
def weed_out_trivial_matches(matches):
num_indexes = matches.index.nlevels//2
return matches[
functools.reduce(
lambda a, b: a | b,
[
(matches.index.get_level_values(i) != matches.index.get_level_values(i + num_indexes)) \
for i in range(num_indexes)
]
)
]
def compute_totals(merged_df, field_list, hierarchical, exact_field_value_pairs, title_fuzzy):
title_total = 'Total Similarity Score'
if hierarchical:
if exact_field_value_pairs:
totals = merged_df[
[(title_fuzzy, field, 'similarity') for field in field_list]
].sum(axis=1) + len(exact_field_value_pairs)
totals = pd.concat([totals], axis=1, keys=[('', '', title_total)])
else:
totals = merged_df[
[(field, 'similarity') for field in field_list]
].sum(axis=1)
totals = pd.concat([totals], axis=1, keys=[('', title_total)])
else:
if exact_field_value_pairs:
totals = merged_df[
[(title_fuzzy, field) for field in field_list]
].sum(axis=1) + len(exact_field_value_pairs)
totals = pd.concat([totals], axis=1, keys=[('', title_total)])
else:
totals = merged_df[field_list].sum(axis=1)
totals.rename(title_total, inplace=True)
return totals
def get_index_names(df):
empty_df = df.iloc[0:0]
return [field for field in empty_df.reset_index().columns \
if field not in empty_df.columns]
def prepend(strings, prefix):
return [f'{prefix}{i}' for i in strings]
def append_to_first_of_each_tuple(list_of_tuples, list_of_appendices):
return [(tupl[0], ) + (x, ) for tupl, x in list(zip(list_of_tuples, list_of_appendices))]
index_name_list = get_index_names(data_frame)
match_indexes = prepend(index_name_list, prefix='left_') + prepend(index_name_list, prefix='right_')
# set the corpus for each fuzzy field
stringGroupers = []
for field, threshold, ngram_sz in fields_2b_matched_fuzzily:
stringGroupers += [VariableStringSetStringGrouper(data_frame[field],
min_similarity=threshold,
ngram_size=ngram_sz,
max_n_matches=max_n_matches,
tfidf_matrix_dtype=similarity_dtype)]
fuzzy_bundles = append_to_first_of_each_tuple(fields_2b_matched_fuzzily, stringGroupers)
if not fields_2b_matched_exactly:
return horizontal_linkage(data_frame,
match_indexes,
fuzzy_bundles,
hierarchical=hierarchical)
else:
groups = data_frame.groupby(fields_2b_matched_exactly)
vertical_merger_list = []
for group_value, group_df in groups:
values_matched_exactly = [group_value] if not isinstance(group_value, tuple) else list(group_value)
exact_field_value_pairs = list(zip(fields_2b_matched_exactly, values_matched_exactly))
vertical_merger_list += [horizontal_linkage(group_df,
match_indexes,
fuzzy_bundles,
exact_field_value_pairs=exact_field_value_pairs,
hierarchical=hierarchical)]
return pd.concat(vertical_merger_list)
# =========== MAIN ===========
print('Dedupe script')
t0 = time.time()
inputfilename = 'us-cities-real-estate-sample-zenrows.csv'
outputfilename1 = 'deduped_noneexact.csv'
outputfilename2 = 'deduped_someexact.csv'
df = pd.read_csv(inputfilename, dtype=str)
df = df[['zpid', 'statusText', 'addressZipcode', 'addressState', 'address', 'imgSrc', 'hasVideo']]
df.set_index('zpid', inplace=True)
record_matches_none_exact = record_linkage(df,
fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
('address', 0.8, 3),
('addressState', 0.99999, 2),
('addressZipcode', 0.99999, 3),
('hasVideo', 0.99999, 3)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=100,
similarity_dtype=np.float32)
outmatches = \
record_matches_none_exact[
record_matches_none_exact.index.get_level_values(0) < record_matches_none_exact.index.get_level_values(1)
]
print('T1:',time.time()-t0)
t0 = time.time()
outmatches.to_csv(outputfilename1, sep='\t', index=False)
record_matches_some_exact = record_linkage(df,
fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
('address', 0.8, 3)],
fields_2b_matched_exactly=['addressState', 'addressZipcode', 'hasVideo'],
hierarchical=True,
max_n_matches=100,
similarity_dtype=np.float32)
outmatches = record_matches_some_exact[record_matches_some_exact.index.get_level_values(0) < record_matches_some_exact.index.get_level_values(1)]
print('T2:',time.time()-t0)
t0 = time.time()
outmatches.to_csv(outputfilename2, sep='\t', index=False)
print('Done.')
Hi @berndnoll
Very curious indeed! I'm not sure what's going on here.
It doesn't look like you are doing anything different from what I did. Except that you are using quite a small value (100
) for max_n_matches
(I used max_n_matches = 10000
). As you can see, the results I posted before were from the same 10 000 row dataset you sent me in that csv file. And they took
T1: 3m 52.98s T2: 5m 18.61s
What are your machine specs, by the way? Mine are: Intel Core i7 CPU@2.20GHz; 16GB RAM; Cache Memory: 256 KB/ 6 MB with a standard Flash Memory Solid State Drive running the latest Microsoft Windows Home Operating System. The code was ran in a Jupyter notebook in the Microsoft Edge browser using Python 3.9.2. So no quantum computer here. 😃 But about six months ago I was using a hard disk drive which made everything run extremely slow especially with large datasets. Upgrading to a solid state drive made a HUGE difference.
Hi @ParticularMiner, my machine specs are similar to yours...Intel(R) Core(TM) i7-8650U CPU @ 2.11 GHz, 16 GB RAM, SSD, Win 10 Prof.
Looking at the runtimes I had with my (way less elegant) hack vs. your beautiful code I wonder what's causing the slower performance. Just the additional fields? I am not too familiar with code profiling in py, though. Will do some more testing and digging starting Tue when labor day weekend is over and let you know what I find.
Cheers Bernd
Hi @iibarant @berndnoll
This can wait till any time of your convenience. No pressure! I'm on the other side of the Atlantic Ocean anyway. So I do not celebrate Labor Day today. :smiley:
Lessons learnt so far through experimentation:
Prior grouping by fields that are to be matched exactly leads to a performance boost ONLY IF the resulting number of groups is not too large.
Recommendation: do not group by any field that has many unique data-values.
For example, from @berndnoll's sample dataset, the field 'hasVideo' has only two unique data-values ('true' and 'false') leading to only two groups. So here, yes, group by this field if you want to do exact comparisons on this field. Otherwise, calling match_strings()
on this field would take so much longer since so many matches will be found.
On the other hand, for the same dataset, the concatenated field 'addressZipcode+addressCity' has more than 6000 unique values (df['addressZipcode+addressCity'].nunique() = 6579
). So calling string_grouper
's match_strings()
function on each of these 6579 groups one after the other would result in an utter loss of efficiency.
Somewhere between these two limits is a threshold number of groups (or unique field values) below which grouping leads to faster code..
(I already mentioned this in a previous comment.)
Be aware of the fact that applying string_grouper
's match_strings()
function on different subsets of a single pandas Series of strings causes the underlying corpus to change from subset to subset leading to inconsistent similarity-scores across the subsets. Therefore one needs to modify the StringGrouper
class slightly in order to fix the corpus between subsets. See my previous code posting on one way to do this.
@iibarant @berndnoll
By way of illustration of the judicious and injudicious use of grouping, here are some runtimes of the examples I gave in my last post. The following function call with grouping on field 'hasVideo' (which has only two unique data-values: 'true' and 'false') takes just ~35 seconds:
record_matches_some_exact = record_linkage(df.iloc[:, :],
fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
('address', 0.8, 3),
('addressState', 0.999999, 2),
('addressZipcode', 0.999999, 3)],
fields_2b_matched_exactly=['hasVideo'],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32)
whereas the following call without grouping takes ~4.5 minutes to produce the same result:
record_matches_none_exact = record_linkage(df.iloc[:, :],
fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
('address', 0.8, 3),
('addressState', 0.999999, 2),
('addressZipcode', 0.999999, 3),
('hasVideo', 0.999999, 3)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32)
Finally, the following call with grouping on field 'addressZipcode' (which has 6446 unique values) takes ~9 minutes: to produce the same result
record_matches_some_exact = record_linkage(df.iloc[:, :],
fields_2b_matched_fuzzily=[('statusText', 0.8, 3),
('address', 0.8, 3),
('addressState', 0.999999, 2),
('hasVideo', 0.999999, 3)],
fields_2b_matched_exactly=['addressZipcode'],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32)
Hi @berndnoll
Thanks for the compliment! However your code is essentially not that different from mine.
It is true that we both are using different numbers of fields but the main reason for the runtime difference between us is the value of the option parameter max_n_matches
. You used max_n_matches = 20
; I used max_n_matches = 10000
. By so doing you miss a lot of matches, while I capture all possible matches.
Applying my code to the same fields you used (but with max_n_matches=10000
) produced the following runtimes:
T1: 1.74s
T2: 3m 23.55s
@berndnoll @iibarant
Looks like the magic number is ~375. (See plot below.)
So, at least, for a 10 000-row dataset 'grouping' is faster than 'not grouping' if the data can be grouped into less than 375 groups by the exact fields., otherwise 'not grouping' is faster.
record_linkage()
calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame@ParticularMiner,
That's a great experiment. I ran your exact code on data from @berndnoll and received the following run time figures: T1 = 3 min 8.07 sec T2 = 2 min 38.43 sec
But setting up the only addressState field for the exact match reduced the runtime to only 5.07 seconds !!! So, my intuitive approach was proved.
Thank you very much for the professional code that is very convenient for any user.
Hi @iibarant
Indeed your intuition was right! And it was right because we have only 51 states to group 10 000 records by. Such a small number of unique values is well below the crossover number of 375.
Hope you had an enjoyable Labor Day!
Thank you! The Labor Day was great.
I tried the following code on my 3.5 million data frame:
record_matches = record_linkage(df,
fields_2b_matched_fuzzily=[('full address', 0.8, 3)],
fields_2b_matched_exactly=['foad'],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32)
interesting_record_matches_some_exact = \
record_matches[
record_matches.index.get_level_values(0) < record_matches.index.get_level_values(1)].sort_values(('', '', 'Total Similarity Score'), ascending=False)
t = time.time()-t1
print("Runtime (sec): ", t)
And I got the following error: Traceback (most recent call last):
File "
File "/Users/iibarant/Desktop/Python scripts/Explorations", line 174, in record_linkage vertical_merger_list += [horizontal_linkage(group_df,
File "/Users/iibarant/Desktop/Python scripts/Explorations", line 74, in horizontal_linkage matches = match_strings(df[field], sg)
File "/Users/iibarant/Desktop/Python scripts/Explorations", line 55, in match_strings sg = sg.fit()
File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 264, in fit matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)
File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 467, in _build_matches return awesome_cossim_topn(
File "/opt/anaconda3/lib/python3.8/site-packages/sparse_dot_topn/awesome_cossim_topn.py", line 119, in awesome_cossim_topn alt_indices, alt_data = ct_thread.sparse_dot_topn_extd_threaded(
File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 133, in sparse_dot_topn.sparse_dot_topn_threaded.__pyx_fuse_0sparse_dot_topn_extd_threaded
File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 168, in sparse_dot_topn.sparse_dot_topn_threaded.sparse_dot_topn_extd_threaded
OverflowError: value too large to convert to int
@iibarant
You may have to reduce the value of max_n_matches
. 3.5 million records is testing the limits of your computer.
@iiberant
Though this is an ambitious suggestion, you could also try to figure out how to use string_grouper
to work on smaller, manageable pieces of your dataset, and afterwards piece together all the different results to form the whole, thus avoiding the OverflowError
.
@ParticularMiner Reducing the max_n_matches to 5000 allowed to run the whole data for 1 hour 53 minutes which is not bad for the size of the dataframe (over3.5 million records).
@iibarant That's great. Did you ascertain that that captured all possible matches?
Hi @iibarant @berndnoll
Sorry to bother you with this: I have just updated my code again (please follow this link to obtain the update), and I need someone else to test it, as I do not have the appropriate data.
The idea is to test the code with very large datasets (>3 million records) without having to worry about the value of max_n_matches
. Since both of you have access to such data, you are best suited to test this code. You may simply try doing again what didn't work before.
The code is supposed to work regardless of the size of the DataFrame and regardless of how large the value of max_n_matches
is. In other words, the OverflowError
, which you have hitherto both experienced, should no longer appear. This is quite important, since if max_n_matches
is too small, some matches will be missed by the routine.
Your help is very much appreciated, and will also go a long way to improve the package string_grouper
itself.
Thanks in advance.
@ParticularMiner, I will do my best. Should it be a different call of the procedure as the previous one required the max_n_matches?
This is the one I used:
t1 = time.time()
record_matches = record_linkage(df,
fields_2b_matched_fuzzily=[('full address', 0.8, 3)],
fields_2b_matched_exactly=['foad'],
hierarchical=True,
max_n_matches=5000,
similarity_dtype=np.float32)
best_matches = \
record_matches[
record_matches.index.get_level_values(0) < record_matches.index.get_level_values(1)].sort_values(('', '', 'Total Similarity Score'), ascending=False)
t = time.time()-t1
print("Runtime (sec): ", t)
By the way, I ran your code for California only (around 380 000 records) and it took 26 minutes while the code below took 19 minutes:
# Creating a sample data frame (California as example) = CA
df = df_cleaned[df_cleaned['foad']=='CA'].copy()
t0 = time.time()
matches = match_strings(df['full address'],max_n_matches = 5000, min_similarity = 0.8)
t = time.time()-t0
print(" Ran for:", t, " seconds")
What could be a reason for slower performance?
Another fact is that the actual maximum number of matches for California was around 2000.
Thank you!
Many thanks @iibarant
I've just updated the code once more because of a bug. So kindly try the latest. Yes, try it with exactly the code you suggested.
As a first guess, perhaps, the 19 minutes vs the 26 minutes is due to the additional annotations that accompany the output data like the 'total similarity score' column which also take some time to be computed. But I'll give it a check.
With regards to your last statement:
Another fact is that the actual maximum number of matches for California was around 2000.
Is that what you expected? Did both methods yield the same number of matches?
@iibarant
Regarding the California example, can you post me the exact call of my code that you used? I'm not sure if you used grouping or not.
@ParticularMiner, one by one:
# Creating a sample data frame (California as example) = CA
df = df_cleaned[df_cleaned['foad']=='CA'].copy()
t1 = time.time()
record_matches = record_linkage(df,
fields_2b_matched_fuzzily=[('full address', 0.8, 3)],
fields_2b_matched_exactly=['foad'],
hierarchical=True,
max_n_matches=5000,
similarity_dtype=np.float32)
best_matches = \
record_matches[
record_matches.index.get_level_values(0) < record_matches.index.get_level_values(1)].sort_values(('', '', 'Total Similarity Score'), ascending=False)
t = time.time()-t1
print("Runtime (sec): ", t)
Runtime (sec): 1571.4641342163086
I believe the grouping seats here: python fields_2b_matched_exactly=['foad']
t0 = time.time()
States = df_cleaned['foad'].value_counts().reset_index().rename({'index': 'State','foad': 'N'},axis = 1)
states = list(States['State'])
match_list = []
for state in states: df = df_cleaned[df_cleaned['foad'] == state].copy() matches = match_strings(df['full address'],max_n_matches = 5000, min_similarity = 0.8) match_list.append(matches)
final_match = pd.concat(match_list, ignore_index=True)
t = time.time()-t0 print("Ran for:", t, " seconds")
Ran for: 35194.9951941967 seconds
@iibarant
Does the following call make it much faster?
# Creating a sample data frame (California as example) = CA
df = df_cleaned[df_cleaned['foad']=='CA'].copy()
t1 = time.time()
record_matches = record_linkage(df,
fields_2b_matched_fuzzily=[('full address', 0.8, 3)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=5000,
similarity_dtype=np.float32)
best_matches = \
record_matches[
record_matches.index.get_level_values(0) < record_matches.index.get_level_values(1)].sort_values(('', '', 'Total Similarity Score'), ascending=False)
t = time.time()-t1
print("Runtime (sec): ", t)
If it does, then the grouping is taking a significant amount of time.
@iibarant
With regards to your posting #2: the result you get is going to be incorrect because of the reasons I outlined before: that is, similarity values are inconsistent across groups. If you want to do it that way you would need to use a modified version of StringGrouper that fixes the corpus across groups.
But if it's only the 'California' group, then that's OK.
@ParticularMiner,
Running the code from your last message on the very latest update of the class VariableStringSetStringGrouper(StringGrouper)
returns the following error: Traceback (most recent call last):
File "
File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 5294, in sort_values k = self._get_label_or_level_values(by, axis=axis)
File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 1563, in _get_label_or_level_values raise KeyError(key)
KeyError: ('', '', 'Total Similarity Score')
Sorry @iibarant
That error is coming from the line after the call to recordLinkage()
.
I carelessly changed one line forgetting about the next since I am not running the code myself. You may remove that next line, since we just want to see the timing.
@ParticularMiner, there is no big difference for California data:
t1 = time.time()
record_matches = record_linkage(df,
fields_2b_matched_fuzzily=[('full address', 0.8, 3)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=5000,
similarity_dtype=np.float32)
best_matches = \
record_matches[
record_matches.index.get_level_values(0) < record_matches.index.get_level_values(1)]#.sort_values(('', '', 'Total Similarity Score'), ascending=False)
t = time.time()-t1
print("Runtime (sec): ", t)
Runtime (sec): 1490.1938490867615
record_matches = record_linkage(df,
fields_2b_matched_fuzzily=[('full address', 0.8, 3)],
fields_2b_matched_exactly=['foad'],
hierarchical=True,
max_n_matches=5000,
similarity_dtype=np.float32)
best_matches = \
record_matches[
record_matches.index.get_level_values(0) < record_matches.index.get_level_values(1)]#.sort_values(('', '', 'Total Similarity Score'), ascending=False)
t = time.time()-t1
print("Runtime (sec): ", t)
Runtime (sec): 1550.0441522598267
Thanks @iibarant !
I'll look further into the slowdown in performance.
Hi @iibarant
I'm quite puzzled that you obtain those timings of the order of 20 minutes for only 380 000 records.
Your machine specs are better than mine (if I recall correctly). But my machine is able to process 380 000 rows and obtain more than 400 000 matches in less than 12 mins! Very odd.
(I found some sample company name data (sec__edgar_company_info.csv) on the internet to use for testing at https://www.kaggle.com/dattapiy/sec-edgar-companies-list/version/1)
recordLinkage()
took 11.47 minutes
match_strings
(from string_grouper
) took 11.42 minutes
These are the calls (using latest code):
import random
N = 380000
companies = pd.read_csv('data/sec__edgar_company_info.csv')
companies = companies.loc[random.sample(range(len(companies)), k = N)]
record_matches_some_exact = record_linkage(companies,
fields_2b_matched_fuzzily=[('Company Name', 0.8, 3)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=True)
record_matches_some_exact = match_strings(companies['Company Name'], max_n_matches=10000)
Perhaps, you could try that sample data too on your machine to see if you obtain similar times, since I do not have access to your data anyway.
😃
@iibarant
I've included a few log messages in the latest code update to help investigate what is happening on your machine. 😃
@ParticularMiner The company name data set is nice, but only has few columns.
Volume testing with multiple columns will be better with this one: https://www.kaggle.com/peopledatalabssf/free-7-million-company-dataset
Cheers Bernd
What a gigantic file @berndnoll ! Thanks.
Hi @ParticularMiner,
I ran the very latest code update on California data and the Runtime (sec): 1357.2680039405823
Unfortunately I don't not see any log messages.
I believe I need to add force_corrections=True
@ParticularMiner,
I ran your latest code on your data (sec__edgar_company_info.csv). The run time is actually faster than you posted:
t1 = time.time()
N = 380000
companies = pd.read_csv(path + 'sec__edgar_company_info.csv')
companies = companies.loc[random.sample(range(len(companies)), k = N)]
record_matches_some_exact = record_linkage(companies,
fields_2b_matched_fuzzily=[('Company Name', 0.8, 3)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=True)
t = time.time()-t1
print("Runtime (sec): ", t) Warning: Failure matching data partitions: (0, 380000) vs (0, 380000) Dataset may be too large. Now attempting to split data into two smaller parts ... Success matching data partitions: (0, 380000) vs (0, 380000) Runtime (sec): 272.6912159919739
@iibarant
CALIFORNIA DATA:
Ok. At least it's gone down from 26 mins to 23 mins. The log messages are supposed to appear during the handling of an OverflowError
(if it occurs). This means that the slowdown in performance is not due to the splitting of the dataset. That is, the dataset was not split in this call.
Never mind the parameter force_corrections
it's simply there to to check whether forcing symmetry in the matches takes a chunk of the time. But it doesn't seem to take much time.
I still wonder why the runtime is not closer to 19 mins which you got with string_grouper
's match_strings()
.
You noticed my runs of a similar dataset? I'd be interested to know what your timings are for the company dataset I pointed out to you. Especially if your runtimes between record_linkage()
and match_strings()
are very different. My runtimes for the two functions were hardly different ~11 minutes.
@iibarant
COMPANY DATA
4.5 minutes? That's ridiculously fast!
So it seems the runtimes are very much data-dependent. How would you describe the California data then? Are they very long strings? Or perhaps many of them are similar to to each other?
Excellent work here. Great performance. Can you please give some advice on how to achieve deduplication for multiple fields, some fuzzy, some 'hard' matches.
Like this one does: https://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html
compare_cl = recordlinkage.Compare() compare_cl.exact('given_name', 'given_name', label='given_name') compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname') compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('suburb', 'suburb', label='suburb') compare_cl.exact('state', 'state', label='state') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
features = compare_cl.compute(pairs, dfA)
Thank you!