Open berndnoll opened 3 years ago
@iibarant
Another thing I find puzzling is that your machine DOES perform the split. Mine DOES NOT for the same call! I'm beginning to wonder if Apple Mac (your OS if I remember correctly) has a different size specification for the int
datatype (which is the cause of the OverflowError).
@ParticularMiner,
the record_linkage took 4 minutes 32 seconds. the match_strings threw "OverflowError: value too large to convert to int" with current setting and after I reduced max_n_matches to 5000, the runtime was 11 minutes and 51 seconds.
Regarding the data, there are a lot of duplicates and the strings are street addresses (without state) with all common abbreviations, like
Avenue --> AVE Boulevard --> BLVD Center --> CTR Circle --> CIR Court --> CT Drive --> DR Expressway --> EXPY Heights --> HTS Highway --> HWY etc.
The indexes are strings with 40 digits and letters characters. Could that be a reason for slow performance?
@iibarant
Wow. Well your machine is performing as expected, I guess. Better than expected in fact.
And it looks like I've solved the OverflowError
issue, right? Now I'm also beginning to wonder if spiltting is sometimes actually faster than not splitting ...
Are you satisfied with the performance? If so, I'll submit record_linkage()
to @Bergvca (the maintainer) for inclusion into string_grouper
.
I'm still not sure about what happened with the California data. I'll keep it in mind though. But if you have any ideas about it please let me know.
@ParticularMiner,
your solution is great, thanks a lot for your contribution. I am satisfied with the performance and you can certainly submit the record_linkage()
into string_grouper
.
What I see as update is to add weights into similarity scores while combining the 'Total Similarity Score' when there are more than one 'fields_2b_matched_fuzzily'. So, 'Total Similarity Score' = weight1 x similarity1 + weight2 x similarity2 ...
I believe @berndnoll wanted to use that approach.
Thank you very much!
@iibarant
Ok. Good. And many thanks to you and @berndnoll for collaborating and sharing data with me despite your professional constraints. Without your input I likely wouldn't have been able to come this far. By the way, do you and @berndnoll work together?
Yes indeed I already saw that @berndnoll had included weights in his code. These can be added relatively easily. Would you like to add them? Typically the sum of all the weights should be 1, right?
@iibarant Yes, I had implemented the weight in my solution. Default weight is 1. That is especially helpful when you have identifiers in the data that indicate clear dupes - e.g. DUNS number.
@ParticularMiner @berndnoll
using weights would be really helpful for fuzzily matching records in data sets with more than one string columns. Like we consider matching phone numbers less important than zip code (or vice versa). When sum(weights) = 1, then it would be easier to interpret the results.
Thank you!
@iibarant
FYI, out of curiosity, I just re-ran the call on the company data while forcing the program to split the data into two parts like your machine did. And I obtained a total runtime of 8 min 7.22 s (down from 11 min 53.76 s minutes without splitting). So indeed, it seems that if the dataset is large enough, splitting can prove to be a faster and more efficient alternative. Still my machine couldn't beat your machine time of 4.5 minutes, though. 😃
BUT ... Force-splitting down even further into four parts brought the runtime down to 5 min 12.54 s !
Afterwards, further splitting into eight parts increased the runtime to 7 min 6.21 s. So there's a limit.
I feel like there might be a relationship to our earlier case of splitting by grouping by another field. But that is research for another time.
@all Remarkable collaboration, I love this thread. Re: weight I am not sure if the sum of weight has to be 1. Having a default of x1 per rule gives more opportunity to put emphasis on identifying attributes.
@berndnoll @iibarant
How about this: the user may specify any number > 0 for each weight. The sum of the weights need not be 1. However, internally, the weights are scaled (or divided) by their sum so that we have:
\<weighted mean similarity score> = (Σfield weightfield × similarityfield) / (Σfield weightfield) |
---|
where Σfield means "sum over fields"
Is this satisfactory?
The code has now been updated with the above feature.
I've also cleaned-up the clutter to improve the speed a bit. @iibarant could you please try this new code on your California data again? I want to know whether these improvements made any difference to the runtimes on your machine.
See updated code below:
import pandas as pd
import numpy as np
import functools
from string_grouper import StringGrouper
from scipy.sparse.csr import csr_matrix
from topn import awesome_topn
def record_linkage(data_frame,
fields_2b_matched_fuzzily,
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=None,
similarity_dtype=np.float32,
force_corrections=True,
n_blocks=None
):
'''
Function that combines similarity-matching results of several fields of a DataFrame and returns them in another
DataFrame
:param data_frame: pandas.DataFrame of strings.
:param fields_2b_matched_fuzzily: List of tuples. Each tuple is a triple
(<field name>, <threshold>, <ngram_size>, <weight>).
<field name> is the name of a field in data_frame which is to be matched using a threshold similarity score of
<threshold> and an ngram size of <ngram_size>. <weight> is a number that defines the **relative** importance
of the field to other fields -- the field's contribution to the total similarity will be weighted by this
number.
:param fields_2b_matched_exactly: List of tuples. Each tuple is a pair (<field name>, <weight>). <field name> is
the name of a field in data_frame which is to be matched exactly. <weight> has the same meaning as in
parameter fields_2b_matched_fuzzily. Defaults to None.
:param hierarchical: bool. Determines if the output DataFrame will have a hierarchical column-structure (True) or
not (False). Defaults to True.
:param max_n_matches: int. Maximum number of matches allowed per string.
:param similarity_dtype: numpy type. Either np.float32 (the default) or np.float64. A value of np.float32 allows
for less memory overhead during computation but less numerical precision, while np.float64 allows for greater
numerical precision but a larger memory overhead.
:param force_corrections: bool. Specifies whether corrections should be made to the results to account for
symmetry thus removing certain errors due to lack of numerical precision
:param n_blocks: Tuple[(int, int)]. This parameter is provided to boost performance, if possible, by splitting the
dataset into n_blocks[0] blocks for the left operand (of the "comparison operator") and into n_blocks[1] blocks
for the right operand before performing the string-comparisons blockwise.
:return: pandas.DataFrame containing matching results.
'''
def get_field_names(fields_tuples):
return list(list(zip(*fields_tuples))[0])
def get_exact_weights(exact_field_tuples):
return list(list(zip(*exact_field_tuples))[1])
def get_fuzzy_weights(fuzzy_field_tuples):
return list(list(zip(*fuzzy_field_tuples))[3])
def get_field_value_pairs(field_names, values):
return list(zip(field_names, values))
def get_field_stringGrouper_pairs(fuzzy_field_tuples, string_groupers):
return [(tupl[0], ) + (x, ) for tupl, x in list(zip(fuzzy_field_tuples, string_groupers))]
def get_index_names(df):
empty_df = df.iloc[0:0]
return [field for field in empty_df.reset_index().columns \
if field not in empty_df.columns]
def prepend(strings, prefix):
return [f'{prefix}{i}' for i in strings]
def horizontal_linkage(df,
match_indexes,
fuzzy_field_grouper_pairs,
fuzzy_field_names,
fuzzy_field_weights,
exact_field_value_pairs=None,
exact_field_weights=None,
hierarchical=True,
force_corrections=False,
n_blocks=None):
horizontal_merger_list = []
for field, sg in fuzzy_field_grouper_pairs:
matches = match_strings(df[field], sg, force_corrections=force_corrections, n_blocks=n_blocks)
matches.set_index(match_indexes, inplace=True)
matches = weed_out_trivial_matches(matches)
if hierarchical:
merger = matches[[f'left_{field}', 'similarity', f'right_{field}']]
merger.rename(columns={f'left_{field}': 'left', f'right_{field}': 'right'}, inplace=True)
else:
merger = matches[['similarity']]
merger.rename(columns={'similarity': field}, inplace=True)
horizontal_merger_list += [merger]
key_list = None if not hierarchical else fuzzy_field_names
merged_df = pd.concat(horizontal_merger_list, axis=1, keys=key_list, join='inner')
title_exact = 'Exactly Matched Fields'
title_fuzzy = 'Fuzzily Matched Fields'
if exact_field_value_pairs:
exact_df = build_column_precursor_to(merged_df, exact_field_value_pairs)
merged_df = pd.concat([exact_df, merged_df],
axis=1,
keys=[title_exact, title_fuzzy],
join='inner')
totals = compute_totals(merged_df,
fuzzy_field_names,
fuzzy_field_weights,
exact_field_weights,
hierarchical,
exact_field_value_pairs,
title_fuzzy)
return pd.concat([totals, merged_df], axis=1)
def weed_out_trivial_matches(matches):
num_indexes = matches.index.nlevels//2
return matches[
functools.reduce(
lambda a, b: a | b,
[
(matches.index.get_level_values(i) != matches.index.get_level_values(i + num_indexes)) \
for i in range(num_indexes)
]
)
]
def build_column_precursor_to(df, exact_field_value_pairs):
exact_df = df.iloc[:, 0:0]
for field_name, field_value in exact_field_value_pairs:
exact_df[field_name] = field_value
return exact_df
def compute_totals(merged_df,
fuzzy_field_names,
fuzzy_field_weights,
exact_field_weights,
hierarchical,
exact_field_value_pairs,
title_fuzzy):
title_total = 'Weighted Mean Similarity Score'
fuzzy_weight_array = np.array(fuzzy_field_weights, dtype=float)
if exact_field_value_pairs:
exact_weight_array = np.array(exact_field_weights, dtype=float)
total = fuzzy_weight_array.sum() + exact_weight_array.sum()
fuzzy_weight_array /= total
exact_field_contribution = (exact_weight_array/total).sum()
if hierarchical:
totals = merged_df[
[(title_fuzzy, field, 'similarity') for field in fuzzy_field_names]
].dot(fuzzy_weight_array) + exact_field_contribution
totals = pd.concat([totals], axis=1, keys=[('', '', title_total)])
else:
totals = merged_df[
[(title_fuzzy, field) for field in fuzzy_field_names]
].dot(fuzzy_weight_array) + exact_field_contribution
totals = pd.concat([totals], axis=1, keys=[('', title_total)])
else:
fuzzy_weight_array /= fuzzy_weight_array.sum()
if hierarchical:
totals = merged_df[
[(field, 'similarity') for field in fuzzy_field_names]
].dot(fuzzy_weight_array)
totals = pd.concat([totals], axis=1, keys=[('', title_total)])
else:
totals = merged_df[fuzzy_field_names].dot(fuzzy_weight_array)
totals.rename(title_total, inplace=True)
return totals
index_name_list = get_index_names(data_frame)
match_indexes = prepend(index_name_list, prefix='left_') + prepend(index_name_list, prefix='right_')
# set the corpus for each fuzzy field
stringGroupers = []
for field, threshold, ngram_sz, _ in fields_2b_matched_fuzzily:
stringGroupers += [FixedCorpusStringGrouper(data_frame[field],
min_similarity=threshold,
ngram_size=ngram_sz,
max_n_matches=max_n_matches,
tfidf_matrix_dtype=similarity_dtype)]
fuzzy_field_grouper_pairs = get_field_stringGrouper_pairs(fields_2b_matched_fuzzily, stringGroupers)
fuzzy_field_names = get_field_names(fields_2b_matched_fuzzily)
fuzzy_field_weights = get_fuzzy_weights(fields_2b_matched_fuzzily)
if not fields_2b_matched_exactly:
return horizontal_linkage(data_frame,
match_indexes,
fuzzy_field_grouper_pairs,
fuzzy_field_names,
fuzzy_field_weights,
hierarchical=hierarchical,
force_corrections=force_corrections,
n_blocks=n_blocks)
else:
exact_field_names = get_field_names(fields_2b_matched_exactly)
exact_field_weights = get_exact_weights(fields_2b_matched_exactly)
groups = data_frame.groupby(exact_field_names)
vertical_merger_list = []
for group_value, group_df in groups:
values_matched_exactly = [group_value] if not isinstance(group_value, tuple) else list(group_value)
exact_field_value_pairs = get_field_value_pairs(exact_field_names, values_matched_exactly)
vertical_merger_list += [horizontal_linkage(group_df,
match_indexes,
fuzzy_field_grouper_pairs,
fuzzy_field_names,
fuzzy_field_weights,
exact_field_value_pairs,
exact_field_weights,
hierarchical=hierarchical,
force_corrections=force_corrections,
n_blocks=n_blocks)]
return pd.concat(vertical_merger_list)
def match_strings(master, sg, force_corrections=True, n_blocks=None):
sg._master = master
sg = sg.fit(force_corrections=force_corrections, n_blocks=n_blocks)
out = sg.get_matches()
sg._matches_list = None
sg._master = None
return out
class FixedCorpusStringGrouper(StringGrouper):
# This class enables StringGrouper to apply matching on different succesive master datasets without
# resetting the underlying corpus, as long as each master dataset is contained in the corpus.
# This class inherits from StringGrouper, overwriting only two of its methods:
# __init__() and _get_tf_idf_matrices()
def __init__(self, corpus, **kwargs):
# initializer is the same as before except that it now also sets the corpus
super().__init__(corpus, **kwargs)
self._vectorizer = self._fit_vectorizer()
self._master = None
def _get_tf_idf_matrices(self, left_partition, right_partition):
# _get_tf_idf_matrices() now no longer sets the corpus but rather builds the matrices from
# the existing corpus
# Build the two matrices
left_matrix = self._vectorizer.transform(self._master.iloc[slice(*left_partition)])
right_matrix = self._vectorizer.transform(self._master.iloc[slice(*right_partition)])
return left_matrix, right_matrix
def fit_blockwise_manual(self, n_blocks=(1, 1)):
def divide_by(n):
# mark blocks
equal_block_sz = len(self._master)//n
block_rem = len(self._master)%n
block_ranges = []
start = 0
for block_id in range(n):
block_ranges += [(start, start + equal_block_sz + (1 if block_id < block_rem else 0))]
start = block_ranges[-1][1]
return block_ranges
block_ranges_left = divide_by(n_blocks[0])
block_ranges_right = divide_by(n_blocks[1])
max_n_matches = self._max_n_matches
for left_block in block_ranges_left:
for right_block in block_ranges_right:
self._max_n_matches = min(right_block[1] - right_block[0], max_n_matches)
master_matrix, duplicate_matrix = self._get_tf_idf_matrices(left_block, right_block)
# Calculate the matches using the cosine similarity
matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)
# build match-lists from matrix
r, c = matches.nonzero()
d = matches.data
(self._r, self._c, self._d) = (
np.append(self._r, r + left_block[0]),
np.append(self._c, c + right_block[0]),
np.append(self._d, d)
)
self._max_n_matches = max_n_matches
return True
def fit_blockwise_auto(self, left_partition=(None, None), right_partition=(None, None)):
"""Builds the _matches list which contains string matches indices and similarity"""
# fit() has been extended here to enable StringGrouper to handle large datasets which otherwise
# would lead to an OverflowError
def begin(partition):
return partition[0] if partition[0] else 0
def end(partition):
return partition[1] if partition[1] else len(self._master)
def explicit(partition):
return begin(partition), end(partition)
master_matrix, duplicate_matrix = self._get_tf_idf_matrices(left_partition, right_partition)
try:
# Calculate the matches using the cosine similarity
matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)
except OverflowError:
master_matrix = None
duplicate_matrix = None
max_n_matches = self._max_n_matches
def split_partition(partition):
data_begin = begin(partition)
data_end = end(partition)
data_mid = data_begin + (data_end - data_begin)//2
return [(data_begin, data_mid), (data_mid, data_end)]
left_halves = split_partition(left_partition)
right_halves = split_partition(right_partition)
for lhalf in left_halves:
for rhalf in right_halves:
self._max_n_matches = min(rhalf[1] - rhalf[0], max_n_matches)
self.fit_blockwise_auto(left_partition=lhalf, right_partition=rhalf)
self._max_n_matches = max_n_matches
return True
# build match-lists from matrix
r, c = matches.nonzero()
d = matches.data
(self._r, self._c, self._d) = (
np.append(self._r, r + begin(left_partition)),
np.append(self._c, c + begin(right_partition)),
np.append(self._d, d)
)
return False
def fit(self, n_blocks=None, force_corrections=True):
# initialize match-lists
self._r = np.array([], dtype=np.int64)
self._c = np.array([], dtype=np.int64)
self._d = np.array([], dtype=self._config.tfidf_matrix_dtype)
self._matches_list = pd.DataFrame()
# do the matching
if n_blocks:
split_occurred = self.fit_blockwise_manual(n_blocks=n_blocks)
else:
split_occurred = self.fit_blockwise_auto()
# self._matches_list = pd.DataFrame({'master_side': self._r, 'dupe_side': self._c, 'similarity': self._d})
if split_occurred:
# trim the matches to max_n_matches
self._r, self._c, self._d = awesome_topn(self._r,
self._c,
self._d,
self._max_n_matches,
self._config.number_of_processes)
# force symmetries to be respected?
if force_corrections:
matrix_sz = len(self._master)
matches = csr_matrix(
(
self._d, (self._r, self._c)
),
shape=(matrix_sz, matrix_sz)
)
# release memory
self._r = self._c = self._d = np.array([])
# convert to lil format for best efficiency when setting matrix-elements
matches = matches.tolil()
# matrix diagonal elements must be exactly 1 (numerical precision errors introduced by
# floating-point computations in awesome_cossim_topn sometimes lead to unexpected results)
matches = StringGrouper._fix_diagonal(matches)
# the list of matches must be symmetric! (i.e., if A != B and A matches B; then B matches A)
matches = StringGrouper._symmetrize_matrix(matches)
matches = matches.tocsr()
self._matches_list = self._get_matches_list(matches)
else:
self._matches_list = pd.DataFrame(
{
key: value for key, value in zip(
('master_side', 'dupes_side', 'similarity'), (self._r, self._c, self._d)
)
}
)
# release memory
self._r = self._c = self._d = np.array([])
self.is_build = True
return self
Use @berndnoll 's sample data:
inputfilename = 'data/us-cities-real-estate-sample-zenrows.csv'
df = pd.read_csv(inputfilename, dtype=str)
Examine data to determine which columns to use for test (that is, use only columns without null data):
for field in df.columns:
if len(df[field].unique()) > 1 and not df[field].isna().values.any():
print(f'{field} : {len(df[field].unique())}')
zpid : 10000
id : 10000
imgSrc : 9940
detailUrl : 10000
statusText : 24
address : 10000
addressState : 51
addressZipcode : 6446
isUndisclosedAddress : 2
isZillowOwned : 2
has3DModel : 2
hasVideo : 2
isFeaturedListing : 2
list : 2
Set the index:
df = df[['zpid', 'statusText', 'addressZipcode', 'addressState', 'address', 'imgSrc', 'hasVideo', 'addressCity', 'addressStreet']]
df.set_index('zpid', inplace=True)
Note that grouping by those fields that have very few unique values, such as 'addressState' (51 unique values) and 'hasVideo' (2 unique values), can lead to a significant performance boost. On the other hand, grouping by 'addressZipcode' (6446 unique values) degrades performance.
The following call took 8.11 seconds to run:
record_matches_some_exact = record_linkage(df.iloc[:, :],
fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
('address', 0.8, 3, 1),
('addressZipcode', 0.999999, 3, 2)],
fields_2b_matched_exactly=[('addressState', 4),
('hasVideo', 1)],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=False)
interesting_record_matches_some_exact = \
record_matches_some_exact[
record_matches_some_exact.index.get_level_values(0) < record_matches_some_exact.index.get_level_values(1)
].sort_values(('', '', 'Weighted Mean Similarity Score'), ascending=False)
interesting_record_matches_some_exact
Exactly Matched Fields | Fuzzily Matched Fields | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
addressState | hasVideo | statusText | address | addressZipcode | |||||||||
Weighted Mean Similarity Score | left | similarity | right | left | similarity | right | left | similarity | right | ||||
left_zpid | right_zpid | ||||||||||||
2077667803 | 2077679643 | 1.000000 | WY | false | Lot / Land for sale | 1.0 | Lot / Land for sale | Jlda Minor Sub Division LOT C, Buffalo, WY 82834 | 1.000000 | Jlda Minor Subdivision LOT C, Buffalo, WY 82834 | 82834 | 1.0 | 82834 |
2075244057 | 2075358943 | 0.997100 | OH | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 0 Township Road 118, Kimbolton, OH 43749 | 0.973904 | Township Road 118, Kimbolton, OH 43749 | 43749 | 1.0 | 43749 |
2077676622 | 2077676809 | 0.993867 | ND | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 4 55th St SE, Christine, ND 58015 | 0.944802 | 2 55th St SE, Christine, ND 58015 | 58015 | 1.0 | 58015 |
2077093064 | 2078843498 | 0.993328 | SD | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 17 Sidney Park Rd, Custer, SD 57730 | 0.939948 | Sidney Park Rd, Custer, SD 57730 | 57730 | 1.0 | 57730 |
150690392 | 2076123604 | 0.992909 | NJ | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 5 Windsor Ln, Gladstone, NJ 07934 | 0.936180 | 0 Windsor Ln, Gladstone, NJ 07934 | 7934 | 1.0 | 7934 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2070837516 | 2072047318 | 0.978032 | HI | false | New construction | 1.0 | New construction | D12C Plan, Kaikoi at Hoopili | 0.802290 | D12B Plan, Kaikoi at Hoopili | 96706 | 1.0 | 96706 |
305578084 | 90035758 | 0.977991 | MO | false | Condo for sale | 1.0 | Condo for sale | 210 N 17th St UNIT 203, Saint Louis, MO 63103 | 0.801920 | 210 N 17th St UNIT 1202, Saint Louis, MO 63103 | 63103 | 1.0 | 63103 |
2071195670 | 88086529 | 0.977983 | MI | false | Condo for sale | 1.0 | Condo for sale | 6533 E Jefferson Ave APT 426, Detroit, MI 48207 | 0.801844 | 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 | 48207 | 1.0 | 48207 |
247263033 | 247263136 | 0.977941 | IA | false | New construction | 1.0 | New construction | 1 University Way #511, Iowa City, IA 52246 | 0.801474 | 1 University Way #503, Iowa City, IA 52246 | 52246 | 1.0 | 52246 |
2083656138 | 2083656146 | 0.977873 | IN | false | Condo for sale | 1.0 | Condo for sale | 3789 S Anderson Dr, Terre Haute, IN 47803 | 0.800855 | 3776 S Anderson Dr, Terre Haute, IN 47803 | 47803 | 1.0 | 47803 |
94 rows × 12 columns
The results above can also be obtained in another way. However, it can take much longer to compute in cases where some fuzzily matched fields have very few uniques values.
The following call took 3 minutes 33.52 seconds to run:
record_matches_none_exact = record_linkage(df.iloc[:, :],
fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
('address', 0.8, 3, 1),
('addressZipcode', 0.999999, 3, 2),
('hasVideo', 0.999999, 3, 1),
('addressState', 0.999999, 2, 4)],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=False)
interesting_record_matches_none_exact = \
record_matches_none_exact[
record_matches_none_exact.index.get_level_values(0) < record_matches_none_exact.index.get_level_values(1)
].sort_values(('', 'Weighted Mean Similarity Score'), ascending=False)
interesting_record_matches_none_exact
statusText | address | addressZipcode | hasVideo | addressState | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Weighted Mean Similarity Score | left | similarity | right | left | similarity | right | left | similarity | right | left | similarity | right | left | similarity | right | ||
left_zpid | right_zpid | ||||||||||||||||
2077667803 | 2077679643 | 1.000000 | Lot / Land for sale | 1.0 | Lot / Land for sale | Jlda Minor Sub Division LOT C, Buffalo, WY 82834 | 1.000000 | Jlda Minor Subdivision LOT C, Buffalo, WY 82834 | 82834 | 1.0 | 82834 | false | 1.0 | false | WY | 1.0 | WY |
2075244057 | 2075358943 | 0.997100 | Lot / Land for sale | 1.0 | Lot / Land for sale | 0 Township Road 118, Kimbolton, OH 43749 | 0.973904 | Township Road 118, Kimbolton, OH 43749 | 43749 | 1.0 | 43749 | false | 1.0 | false | OH | 1.0 | OH |
2077676622 | 2077676809 | 0.993867 | Lot / Land for sale | 1.0 | Lot / Land for sale | 4 55th St SE, Christine, ND 58015 | 0.944802 | 2 55th St SE, Christine, ND 58015 | 58015 | 1.0 | 58015 | false | 1.0 | false | ND | 1.0 | ND |
2077093064 | 2078843498 | 0.993328 | Lot / Land for sale | 1.0 | Lot / Land for sale | 17 Sidney Park Rd, Custer, SD 57730 | 0.939948 | Sidney Park Rd, Custer, SD 57730 | 57730 | 1.0 | 57730 | false | 1.0 | false | SD | 1.0 | SD |
150690392 | 2076123604 | 0.992909 | Lot / Land for sale | 1.0 | Lot / Land for sale | 5 Windsor Ln, Gladstone, NJ 07934 | 0.936180 | 0 Windsor Ln, Gladstone, NJ 07934 | 7934 | 1.0 | 7934 | false | 1.0 | false | NJ | 1.0 | NJ |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2070837516 | 2072047318 | 0.978032 | New construction | 1.0 | New construction | D12C Plan, Kaikoi at Hoopili | 0.802290 | D12B Plan, Kaikoi at Hoopili | 96706 | 1.0 | 96706 | false | 1.0 | false | HI | 1.0 | HI |
305578084 | 90035758 | 0.977991 | Condo for sale | 1.0 | Condo for sale | 210 N 17th St UNIT 203, Saint Louis, MO 63103 | 0.801920 | 210 N 17th St UNIT 1202, Saint Louis, MO 63103 | 63103 | 1.0 | 63103 | false | 1.0 | false | MO | 1.0 | MO |
2071195670 | 88086529 | 0.977983 | Condo for sale | 1.0 | Condo for sale | 6533 E Jefferson Ave APT 426, Detroit, MI 48207 | 0.801844 | 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 | 48207 | 1.0 | 48207 | false | 1.0 | false | MI | 1.0 | MI |
247263033 | 247263136 | 0.977941 | New construction | 1.0 | New construction | 1 University Way #511, Iowa City, IA 52246 | 0.801474 | 1 University Way #503, Iowa City, IA 52246 | 52246 | 1.0 | 52246 | false | 1.0 | false | IA | 1.0 | IA |
2083656138 | 2083656146 | 0.977873 | Condo for sale | 1.0 | Condo for sale | 3789 S Anderson Dr, Terre Haute, IN 47803 | 0.800855 | 3776 S Anderson Dr, Terre Haute, IN 47803 | 47803 | 1.0 | 47803 | false | 1.0 | false | IN | 1.0 | IN |
94 rows × 16 columns
One may choose to remove the field-values and output single-level column-headings by setting hierarchical to False
:
record_matches_none_exact = record_linkage(df.iloc[:, :],
fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
('address', 0.8, 3, 1),
('addressZipcode', 0.999999, 3, 2),
('hasVideo', 0.999999, 3, 1),
('addressState', 0.999999, 2, 4)],
hierarchical=False,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=False)
c:\users\heamu\.virtualenvs\distribution-test-bed\lib\site-packages\pandas\core\frame.py:4441: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().rename(
interesting_record_matches_none_exact = \
record_matches_none_exact[
record_matches_none_exact.index.get_level_values(0) < record_matches_none_exact.index.get_level_values(1)
].sort_values(('Weighted Mean Similarity Score'), ascending=False)
interesting_record_matches_none_exact
Weighted Mean Similarity Score | statusText | address | addressZipcode | hasVideo | addressState | ||
---|---|---|---|---|---|---|---|
left_zpid | right_zpid | ||||||
2077667803 | 2077679643 | 1.000000 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.0 |
2075244057 | 2075358943 | 0.997100 | 1.0 | 0.973904 | 1.0 | 1.0 | 1.0 |
2077676622 | 2077676809 | 0.993867 | 1.0 | 0.944802 | 1.0 | 1.0 | 1.0 |
2077093064 | 2078843498 | 0.993328 | 1.0 | 0.939948 | 1.0 | 1.0 | 1.0 |
150690392 | 2076123604 | 0.992909 | 1.0 | 0.936180 | 1.0 | 1.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... |
2070837516 | 2072047318 | 0.978032 | 1.0 | 0.802290 | 1.0 | 1.0 | 1.0 |
305578084 | 90035758 | 0.977991 | 1.0 | 0.801920 | 1.0 | 1.0 | 1.0 |
2071195670 | 88086529 | 0.977983 | 1.0 | 0.801844 | 1.0 | 1.0 | 1.0 |
247263033 | 247263136 | 0.977941 | 1.0 | 0.801474 | 1.0 | 1.0 | 1.0 |
2083656138 | 2083656146 | 0.977873 | 1.0 | 0.800855 | 1.0 | 1.0 | 1.0 |
94 rows × 6 columns
@berndnoll @iibarant
record_linkage()
now has another optional parameter, n_blocks
, for you to play with and attempt to improve performance (See latest update). It denotes the number of 'blocks' into which record_linkage()
will divide your dataset before performing the string comparison.
The discovery yesterday was that there exists an optimum number of blocks which gives the fastest speed of comparison. See plot below:
record_linkage()
vs the number of blocks (#blocks
) into which the dataset (380 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison.Hi @berndnoll @iibarant
Recall that last week I was able, on my computer, to shave the run-time from ≈12 minutes to ≈4 minutes of a call of record_linkage()
on 380 000 records of the sample company name data (which is essentially the same as using match_strings()
from string_grouper
)?
Well, I have just extended last week's discovery and brought down that runtime further to ≈3 minutes!
The trick here is the manner in which the splitting is done, which I will try to explain below.
String comparison, as implemented by string_grouper
, is essentially matrix multiplication. Your DataFrame of strings is converted (tokenized) into a matrix. Then that matrix is multiplied by itself.
Here is an illustration of multiplication of two matrices M and D:
The discovery we made last week is that when the matrix (or DataFrame) is very large, the computer proceeds quite slowly with the multiplication (apparently due to the RAM being too full). Even @iibarant's powerful computer spat out an OverflowError
.
So we decided to divide the DataFrame into smaller chunks (or blocks) and multiply the chunks one pair at a time instead to get the same result:
But surprise ... the run-time of the process was drastically reduced as a result (from ≈12 minutes to ≈4 minutes when the DataFrame was divided into 4 blocks, that is, 4 blocks on the left × 4 on the right).
Encouraged by this discovery, I explored the block number space a bit further by deciding not to split both right and left matrix operands into the same number of blocks. I consequently discovered that if the left operand is not split but the right operand is, then even more gains in speed can be made.
Here are some plots of the results of experiments that I performed:
record_linkage()
vs the number of blocks (#blocks
) into which the left matrix-operand of the dataset (380 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number of blocks into which the left matrix-operand was split.From the plot above, tt can be seen that the optimum split-configuration (run-time ≈3 minutes) is when the left operand is not split (#blocks = 1) and the right operand is split into six blocks (#nblocks = 6).
Here is an example of how the latest update of record_linkage()
may be used with the just described "right-left splitting":
record_matches_some_exact = record_linkage(companies,
fields_2b_matched_fuzzily=[('Company Name', 0.8, 3, 1)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=True,
n_blocks=(1, 6))
@ParticularMiner, this is great. I didn't have a chance to test the previous update, but I will try this one.
@iibarant @berndnoll
By the way, to run the latest code you first need to install topn
from pypi. Installation command is as usual:
pip install topn
Sorry about that @iibarant .
Then use the previous version for now, which you can get from the same comment where my code is. But click on the “edited’ link at the top of the comment and choose the previous edit.
@iibarant
You posted 2 screens with the log of your topn installation error: one at the beginning and the other at the end. Somewhere in between the actual error that caused all this is missing. I'd appreciate it if you would send me what's missing.
I wrote topn
and posted it to pypi, but it installs fine on my Windows laptop. So I'll need to see the missing errors to determine why your computer is unable to install it as is.
Thanks @iibarant
Looks like your computer does not use a C++14 compiler, which is the reason for the error. Never mind. I'll rewrite the code to something more backward compatible.
@iibarant
Please try installing topn
again. I think it should go through now.
@ParticularMiner, I was able to install now, thank you. I deleted the log messages as they are not relevant anymore.
Sure. No problem.
@ParticularMiner, I started to run 3.5 million records with the following code and got a caveat warning:
@iibarant
I get that too. So far I've been ignoring the warning.
I don't condone it, but you could just suppress all warnings with this:
import warnings
warnings.filterwarnings("ignore")
placed at the beginning of your program.
@ParticularMiner, 1 hour 19 minutes is really good for the size of the dataframe.
@iibarant
Which value of n_blocks
did you use? Or did you leave it up to the machine to choose?
Hi, I used the code shown in the previous comment, so 1 and 6.
Thank you!
On Sep 14, 2021, at 12:52 AM, ParticularMiner @.***> wrote:
 @iibarant
Which value of n_blocks did you use? Or did you leave it up to the machine to choose?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
@iiberant
Ok. Thanks. I agree that it doesn't look bad at all. It could have easily been >6 hours especially if you consider that theoretically, matching-time scales as the square of the DataFrame size.
All the best with your work!
@berndnoll I hope your code is also going smoothly?
@iibarant @berndnoll @Bergvca @justasojourner @ALL
Please see https://github.com/ParticularMiner/red_string_grouper
@ParticularMiner, did you make any changes to the latest version above?
@iibarant
No significant changes. Just changed the name of one of the parameters.
@ParticularMiner, Did you expect such error messages?
 File "
Thank you.
On Thursday, September 16, 2021, 10:09:30 a.m. EDT, ParticularMiner ***@***.***> wrote:
@iibarant
No significant changes. Just changed the name of one of the parameters.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
Amazing work! Love red_string_grouper. Is it possible to add the block splitting to string_grouper as well? Or do you think it doesn't make sense?
This is also the way to do it in a distributed environment (e.g. using Spark) - you send the different blocks to different machines and then gather and merge the results. I never thought it would improve performance on a single machine though.
Excellent work here. Great performance. Can you please give some advice on how to achieve deduplication for multiple fields, some fuzzy, some 'hard' matches.
Like this one does: https://recordlinkage.readthedocs.io/en/latest/notebooks/data_deduplication.html
compare_cl = recordlinkage.Compare() compare_cl.exact('given_name', 'given_name', label='given_name') compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname') compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('suburb', 'suburb', label='suburb') compare_cl.exact('state', 'state', label='state') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
features = compare_cl.compute(pairs, dfA)
Thank you!