Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

Pairwise Similarity Computation (User Request) #40

Closed ParticularMiner closed 3 years ago

ParticularMiner commented 3 years ago

Hi @Bergvca ,

Please see the message by @KiraJYQiu in Issue #39.

I'm not sure if @KiraJYQiu's request is functionality you want added to string_grouper. Nevertheless it is brief and simple enough to code that I've easily added it to the code in this branch "dot".

You may merge this PR if you think it's useful. Here's a brief demo:

import pandas as pd
from string_grouper import compute_pairwise_similarities
pair_s = pd.DataFrame(
    [
        ('Mega Enterprises Corporation', 'Mega Enterprises Corporation'),
        ('Hyper Startup Inc.', 'Hyper Startup Incorporated'),
        ('Hyper Startup Inc.', 'Hyper Startup Inc.'),
        ('Hyper Startup Inc.', 'Hyper-Startup Inc.'),
        ('Hyper Hyper Inc.', 'Hyper Hyper Inc.'),
        ('Mega Enterprises Corporation', 'Mega Enterprises Corp.')
   ],
   columns=('left', 'right')
)

pair_s
left right
0 Mega Enterprises Corporation Mega Enterprises Corporation
1 Hyper Startup Inc. Hyper Startup Incorporated
2 Hyper Startup Inc. Hyper Startup Inc.
3 Hyper Startup Inc. Hyper-Startup Inc.
4 Hyper Hyper Inc. Hyper Hyper Inc.
5 Mega Enterprises Corporation Mega Enterprises Corp.
pair_s['similarity'] = compute_pairwise_similarities(pair_s['left'], pair_s['right'])
pair_s
left right similarity
0 Mega Enterprises Corporation Mega Enterprises Corporation 1.000000
1 Hyper Startup Inc. Hyper Startup Incorporated 0.633620
2 Hyper Startup Inc. Hyper Startup Inc. 1.000000
3 Hyper Startup Inc. Hyper-Startup Inc. 1.000000
4 Hyper Hyper Inc. Hyper Hyper Inc. 1.000000
5 Mega Enterprises Corporation Mega Enterprises Corp. 0.826463
probablyfine commented 3 years ago

This is great, thanks for sharing! I've been doing something similar for my use case for a while now, but my solution is not as fast or as elegant as yours. I'd love to see this make its way into the main branch since I use this functionality multiple times per week.

Bergvca commented 3 years ago

This looks good to me, and @probablyfine already validated that this is useful so good for me too merge. Am I correct that all 3 of your PR's are basically the same branch? If that is the case let me know if you are ready with all 3. Since they update the documentation as well I think it is best to update pypi shortly afterwards (so pypi and the master branch are synced).

ParticularMiner commented 3 years ago

Hi @Bergvca ,

I’m glad you approve of all the PR’s. Currently they are each on separate branches. But to make things easiest for you, let me merge all 3 of them into a single branch so you can merge that single branch (containing everything) into yours.

Regarding backward compatibility, if you feel, between now and then, that we should change the default for ignore_index to True, do let me know. You most likely know better how your code is being used out there and by whom.

Wishing you a good week after Easter! Cheers!

ParticularMiner commented 3 years ago

Hi @probablyfine,

Thanks for validating this PR and the compliment! These will soon be added to the main branch.

Bergvca commented 3 years ago

Hi @ParticularMiner,

No need for a single branch, I think it's nicer to have separate.

Cheers to you as well!

On Mon, Apr 5, 2021, 23:13 ParticularMiner @.***> wrote:

Hi @Bergvca https://github.com/Bergvca ,

I’m glad you approve of all the PR’s. Currently they are each on separate branches. But to make things easiest for you, let me merge all 3 of them into a single branch so you can merge that single branch (containing everything) into yours.

Regarding backward compatibility, if you feel, between now and then, that we should change the default for ignore_index to True, do let me know. You most likely know better how your code is being used out there and by whom.

Wishing you a good week after Easter! Cheers!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Bergvca/string_grouper/pull/40#issuecomment-813653405, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3ROBK7TSNAUWT7KJITAW3THIRW5ANCNFSM4Z5BNHJA .

ParticularMiner commented 3 years ago

Alright. Then let's merge the PR's one by one.

I'd recommend the following merge order:

  1. index (PR #41) Full index-functionality
  2. dot (PR #40) Pairwise similarity computation
  3. sim_zero (PR #43) Zero-similarity bug

I wasn't very careful with my commits, so I expect there to be conflicts between the three branches which only I can resolve. But no worries, I can resolve these conflicts by syncing my local master branch with yours after each PR-merge you make, and then carefully merging my synced master branch into the next PR branch to be merged. I hope that's OK?

Hi @ParticularMiner, No need for a single branch, I think it's nicer to have separate. Cheers to you as well! On Mon, Apr 5, 2021, 23:13 ParticularMiner @.***> wrote: Hi @Bergvca https://github.com/Bergvca , I’m glad you approve of all the PR’s. Currently they are each on separate branches. But to make things easiest for you, let me merge all 3 of them into a single branch so you can merge that single branch (containing everything) into yours. Regarding backward compatibility, if you feel, between now and then, that we should change the default for ignore_index to True, do let me know. You most likely know better how your code is being used out there and by whom. Wishing you a good week after Easter! Cheers! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3ROBK7TSNAUWT7KJITAW3THIRW5ANCNFSM4Z5BNHJA .

ParticularMiner commented 3 years ago

I may have fixed the conflicts now. You may go ahead and merge each branch.

If it helps, here's how the branches look like currently:

2021-04-07

Bergvca commented 3 years ago

Hi @ParticularMiner - as you mentioned there seem to be some conflicts :)

ParticularMiner commented 3 years ago

Hi @Bergvca,

I've fixed this one. So you can go ahead and merge. Sorry for the inconvenience.

Hi @ParticularMiner - as you mentioned there seem to be some conflicts :)