Open utterances-bot opened 4 years ago
Hi Max,
This is excellent but when I edited and ran I received the followng error.
TypeError Traceback (most recent call last)
<ipython-input-40-a232eede1230> in <module>
2 df=Companies,
3 match_func=same_restaurant,
----> 4 max_size=4
5 )
<ipython-input-37-528588a746b8> in find_partitions(df, match_func, max_size, block_by)
51
52 while len(records) >= 1:
---> 53 partition, indexes = find_partition()
54 partitions.append(partition)
55 records = np.delete(records, indexes)
<ipython-input-37-528588a746b8> in find_partition(at, partition, indexes)
43 continue
44
---> 45 if match_func(r1, r2):
46 partition.add(get_record_index(r2))
47 indexes.append(i)
<ipython-input-38-30b02a8d6342> in same_restaurant(r1, r2)
24 (
25 same_name(r1, r2) and
---> 26 similar_address(r1, r2) and
27 same_city(r1, r2) and
28 same_State(r1, r2)
<ipython-input-38-30b02a8d6342> in similar_address(r1, r2)
5 def similar_address(r1, r2):
6 return (
----> 7 fuzz.ratio(r1['Address'], r2['Address']) > 55 or
8 fuzz.partial_ratio(r1['Address'], r2['Address']) > 75
9 )
~\AppData\Local\Continuum\anaconda3\lib\site-packages\fuzzywuzzy\utils.py in decorator(*args, **kwargs)
36 if args[0] is None or args[1] is None:
37 return 0
---> 38 return func(*args, **kwargs)
39 return decorator
40
~\AppData\Local\Continuum\anaconda3\lib\site-packages\fuzzywuzzy\utils.py in decorator(*args, **kwargs)
27 if args[0] == args[1]:
28 return 100
---> 29 return func(*args, **kwargs)
30 return decorator
31
~\AppData\Local\Continuum\anaconda3\lib\site-packages\fuzzywuzzy\utils.py in decorator(*args, **kwargs)
43 @functools.wraps(func)
44 def decorator(*args, **kwargs):
---> 45 if len(args[0]) == 0 or len(args[1]) == 0:
46 return 0
47 return func(*args, **kwargs)
TypeError: object of type 'float' has no len()
I am still a novice at Python and cannot see why.
Hey @sdragoni. I'm not 100% sure but I think your error is coming from the fact that one of your addresses is a NaN.
Hey Max Can we do more faster coz I find recursive takes more time in my data.
I love your codes, thanks! What do you want this algorithm to be called? I'd like to informally cite it.
@mihirpatel7 I'm sorry to hear that! Do you have a dataset to share with me so that I can try to improve the algorithm?
@BitnaKeum I call it the "cross-chain algorithm" :)
Note for people handling large amounts of data like @mihirpatel7, I just stumbled on this article which trigrams + TF-IDF + cosine similarity. Once you set it up, it will output a score for each pair of entities in record time. You can threshold that score to determine whether or not two entities are duplicates of each other. Then you can feed the result into what I described in this article.
Thank you for your code, its very helpful But the max_size doesn't really work for me , do u have any advise how can I fix it?
Hey @aalkuatova, you probably have to change max_size is not None and len(partition) == max_size
to max_size is not None and len(partition) >= max_size
.
Hey Max, I have issue in recursive call function... fuzzy similarity check in your code logic is fine...
find_partition(at=0, partition=None, indexes=None)
I am getting error of stack-overflow due to recursive calls in c# code.... Is there any way we skip this tail-recursion calls while doing the same thing.. Thanks
Hey @mihirpatel7. I'm sorry but I don't have any time to revisit this. What you're asking is possible, but I simply don't have the time to do it. If I were you I would investigate more robust solutions, such as splink.
Hi Max,
This code is really great!!! So thank you. Is there any way to show the score/confidence of the real_id fuzzy match in another column?
Thanks
This code is really great!!! So thank you.
Cheers!
Is there any way to show the score/confidence of the real_id fuzzy match in another column?
Actually the logic I implemented is deterministic, it's either yes or no. The concept of score/confidence doesn't apply here.
I do however recommend that you check out Robin Linacre's blog section on probabilistic linkage, as well as the Splink tool he's developing. You might find what you're looking for there.
Hello Max, Do you have this code in some repo ? can you please provide the link ?
Hey @shahnawazsk-cloud, no sorry this blog post is all there is.
if you're using the block_by
parameter, you'll need to change line no. 16 in the code block that defines find_partitions
to
blocks.loc[b, :].add(blocks.loc[a].iloc[-1] + 1)
my sense is that the newer versions of pandas don't support the +=
operator with the Series dataytpe, hope this is helpful!
thank you Max for putting this together, it is exactly what I was looking for. cleanly put together, and very elegantly explained. much appreciated!
another caveat when using the block_by
parameter, the output real_id
is not at the dataframe level, but at the level of your block_by
parameter, in this case area_code
.
so for the final example in the blog, the output looks like this
id | address | city | name | phone | type | area_code | real_id |
---|---|---|---|---|---|---|---|
1 | 435 s la... | los angeles | arnie mo... | 310 246 ... | american | 310 | 0 |
2 | 435 s la... | los angeles | arnie mo... | 310 246 ... | steakhouses | 310 | 0 |
3 | 12224 ve... | studio city | art s de... | 818 762 ... | american | 818 | 0 |
4 | 12224 ve... | studio city | art s deli | 818 762 ... | delis | 818 | 0 |
i would suggest creating a separate id
combining your block_by
column and the real_id
column when using the output.
Finding fuzzy duplicates with pandas - Max Halford
Duplicate detection is the task of finding two or more instances in a dataset that are in fact identical. As an example, take the following toy dataset: First name Last name Email 0 Erlich Bachman eb@piedpiper.com 1 Erlich Bachmann eb@piedpiper.com 2 Erlik Bachman eb@piedpiper.co 3 Erlich Bachmann eb@piedpiper.com Each of these instances (rows, if you prefer) corresponds to the same “thing” – note that I’m not using the word “entity” because entity resolution is a different, and yet related, concept.
https://maxhalford.github.io/blog/transitive-duplicates/