MaxHalford / maxhalford.github.io

:house_with_garden: Personal website
https://maxhalford.github.io
MIT License
12 stars 5 forks source link

blog/transitive-duplicates/ #4

Open utterances-bot opened 4 years ago

utterances-bot commented 4 years ago

Finding fuzzy duplicates with pandas - Max Halford

Duplicate detection is the task of finding two or more instances in a dataset that are in fact identical. As an example, take the following toy dataset: First name Last name Email 0 Erlich Bachman eb@piedpiper.com 1 Erlich Bachmann eb@piedpiper.com 2 Erlik Bachman eb@piedpiper.co 3 Erlich Bachmann eb@piedpiper.com Each of these instances (rows, if you prefer) corresponds to the same “thing” – note that I’m not using the word “entity” because entity resolution is a different, and yet related, concept.

https://maxhalford.github.io/blog/transitive-duplicates/

sdragoni commented 4 years ago

Hi Max,

This is excellent but when I edited and ran I received the followng error.

TypeError                                 Traceback (most recent call last)
<ipython-input-40-a232eede1230> in <module>
      2     df=Companies,
      3     match_func=same_restaurant,
----> 4     max_size=4
      5 )

<ipython-input-37-528588a746b8> in find_partitions(df, match_func, max_size, block_by)
     51 
     52     while len(records) >= 1:
---> 53         partition, indexes = find_partition()
     54         partitions.append(partition)
     55         records = np.delete(records, indexes)

<ipython-input-37-528588a746b8> in find_partition(at, partition, indexes)
     43                 continue
     44 
---> 45             if match_func(r1, r2):
     46                 partition.add(get_record_index(r2))
     47                 indexes.append(i)

<ipython-input-38-30b02a8d6342> in same_restaurant(r1, r2)
     24         (
     25             same_name(r1, r2) and
---> 26             similar_address(r1, r2) and
     27             same_city(r1, r2) and
     28             same_State(r1, r2)

<ipython-input-38-30b02a8d6342> in similar_address(r1, r2)
      5 def similar_address(r1, r2):
      6     return (
----> 7         fuzz.ratio(r1['Address'], r2['Address']) > 55 or
      8         fuzz.partial_ratio(r1['Address'], r2['Address']) > 75
      9     )

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fuzzywuzzy\utils.py in decorator(*args, **kwargs)
     36         if args[0] is None or args[1] is None:
     37             return 0
---> 38         return func(*args, **kwargs)
     39     return decorator
     40 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fuzzywuzzy\utils.py in decorator(*args, **kwargs)
     27         if args[0] == args[1]:
     28             return 100
---> 29         return func(*args, **kwargs)
     30     return decorator
     31 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\fuzzywuzzy\utils.py in decorator(*args, **kwargs)
     43     @functools.wraps(func)
     44     def decorator(*args, **kwargs):
---> 45         if len(args[0]) == 0 or len(args[1]) == 0:
     46             return 0
     47         return func(*args, **kwargs)

TypeError: object of type 'float' has no len()

I am still a novice at Python and cannot see why.

MaxHalford commented 4 years ago

Hey @sdragoni. I'm not 100% sure but I think your error is coming from the fact that one of your addresses is a NaN.

mihirpatel7 commented 3 years ago

Hey Max Can we do more faster coz I find recursive takes more time in my data.

BitnaKeum commented 2 years ago

I love your codes, thanks! What do you want this algorithm to be called? I'd like to informally cite it.

MaxHalford commented 2 years ago

@mihirpatel7 I'm sorry to hear that! Do you have a dataset to share with me so that I can try to improve the algorithm?

@BitnaKeum I call it the "cross-chain algorithm" :)

MaxHalford commented 2 years ago

Note for people handling large amounts of data like @mihirpatel7, I just stumbled on this article which trigrams + TF-IDF + cosine similarity. Once you set it up, it will output a score for each pair of entities in record time. You can threshold that score to determine whether or not two entities are duplicates of each other. Then you can feed the result into what I described in this article.

aalkuatova commented 2 years ago

Thank you for your code, its very helpful But the max_size doesn't really work for me , do u have any advise how can I fix it?

MaxHalford commented 2 years ago

Hey @aalkuatova, you probably have to change max_size is not None and len(partition) == max_size to max_size is not None and len(partition) >= max_size.

mihirpatel7 commented 2 years ago

Hey Max, I have issue in recursive call function... fuzzy similarity check in your code logic is fine...

find_partition(at=0, partition=None, indexes=None)

I am getting error of stack-overflow due to recursive calls in c# code.... Is there any way we skip this tail-recursion calls while doing the same thing.. Thanks

MaxHalford commented 2 years ago

Hey @mihirpatel7. I'm sorry but I don't have any time to revisit this. What you're asking is possible, but I simply don't have the time to do it. If I were you I would investigate more robust solutions, such as splink.

msiemion commented 2 years ago

Hi Max,

This code is really great!!! So thank you. Is there any way to show the score/confidence of the real_id fuzzy match in another column?

Thanks

MaxHalford commented 2 years ago

This code is really great!!! So thank you.

Cheers!

Is there any way to show the score/confidence of the real_id fuzzy match in another column?

Actually the logic I implemented is deterministic, it's either yes or no. The concept of score/confidence doesn't apply here.

I do however recommend that you check out Robin Linacre's blog section on probabilistic linkage, as well as the Splink tool he's developing. You might find what you're looking for there.

shahnawazsk-cloud commented 1 year ago

Hello Max, Do you have this code in some repo ? can you please provide the link ?

MaxHalford commented 1 year ago

Hey @shahnawazsk-cloud, no sorry this blog post is all there is.

bktaha commented 8 months ago

if you're using the block_by parameter, you'll need to change line no. 16 in the code block that defines find_partitions to blocks.loc[b, :].add(blocks.loc[a].iloc[-1] + 1)

my sense is that the newer versions of pandas don't support the += operator with the Series dataytpe, hope this is helpful!

thank you Max for putting this together, it is exactly what I was looking for. cleanly put together, and very elegantly explained. much appreciated!

bktaha commented 8 months ago

another caveat when using the block_by parameter, the output real_id is not at the dataframe level, but at the level of your block_by parameter, in this case area_code.

so for the final example in the blog, the output looks like this

id address city name phone type area_code real_id
1 435 s la... los angeles arnie mo... 310 246 ... american 310 0
2 435 s la... los angeles arnie mo... 310 246 ... steakhouses 310 0
3 12224 ve... studio city art s de... 818 762 ... american 818 0
4 12224 ve... studio city art s deli 818 762 ... delis 818 0

i would suggest creating a separate id combining your block_by column and the real_id column when using the output.