Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
364 stars 76 forks source link

added blocking capabilities #72

Closed ParticularMiner closed 3 years ago

ParticularMiner commented 3 years ago

Hi @Bergvca

I'm glad you like red_string_grouper! So do I. 😃

  1. As requested, here is a branch with all the matrix-blocking capabilities included.

    The following call on the sec__edgar_companies.csv data file took 4 minutes (down from 25 minutes on my computer with the last version of string_grouper!)

    matches = match_strings(companies['Company Name'], n_blocks=(1, 200))

    It's almost hard to believe.

  2. Included is also the option to supply an already initialized StringGrouper object to the high-level functions match_strings, match_most_similar, group_similar_strings and compute_pairwise_similarities.

    This enables a StringGrouper object to be reused thus the corpus can persist between calls as mentioned by @justasojourner in Issue #69.

  3. The README.md at this moment contains links to files on my branch in order to display the images. CHANGELOG.md also has such links to README.md. After merging (or rebasing) this branch, you will need to do a "search and replace" for the following strings before uploading to pypi,org: "ParticularMiner/string_grouper/block/" → "Bergvca/string_grouper/master/" "ParticularMiner/string_grouper/tree/block/" → "Bergvca/string_grouper/tree/master/"

ParticularMiner commented 3 years ago

@Bergvca

Unit tests now complete. Code coverage is also 100%.

Bergvca commented 3 years ago

Hi @ParticularMiner ,

What would be the best order to review the different PR's? Is it best to start with this one? Thanks!

ParticularMiner commented 3 years ago

Hi @Bergvca ,

Yes. This one.

ParticularMiner commented 3 years ago

@Bergvca

Please don't hesitate to ask me any questions if you have any.

Cheers.

ParticularMiner commented 3 years ago

@Bergvca

Thanks for your comments.

I've updated the code by restoring the high-level functions to what they once were.

Also, I've restored the typing specifications Optional[pd.Series]. Sorry, I shouldn't have removed those.

All getter and setter methods have been removed, except for master and duplicates, since they have their own validation routines.

Following my suggestions for StringGrouper "shadow member functions" corresponding to the high-level functions which will not rebuild the existing corpus, I've also added the following:

sg = StringGrouper(master, duplicates)
sg.match_strings(new_master, new_duplicates, min_similarity=0.7)
sg.match_most_similar(new_master, new_duplicates, min_similarity=0.7)
sg.group_similar_strings(new_master, min_similarity=0.7)
sg.compute_pairwise_similarities(new_master, new_duplicates, min_similarity=0.7)

What do you think?

I'm also awaiting your decision regarding red_string_grouper's wish-list for string_grouper to decide whether to finally remove the code reorganization in StringGrouper's __init__() function or not.

Bergvca commented 3 years ago

Hi @ParticularMiner,

I still don't see the use case for these "shadow member functions. ". I understand for example that you want to dedupe first on a large dataset, and later use the same StringGrouper with smaller datasets. In that case however wouldn't you just want to do something like:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_groups()
# now later with a new set of dupes:
sg.set_duplicates(new_duplicates)
result = sg.get_groups()

If that make's sense? Maybe I'm just missing something?

ParticularMiner commented 3 years ago

Hi @Bergvca

In a red_string_grouper use case for example, I wouldn't want to execute the second and third lines of your code:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_groups()

because they would take just too much time for large master and duplicates. From empirical observation, matching takes far more time than just building the corpus. Building the corpus alone takes a very short time (seconds) even for large datasets.

Besides I wouldn't necessarily need all the matches between master and duplicates. For example, in red_string_grouper, I do not match completely new series but subsets of the original master. In that way, I can choose which strings get matched and those that do not:

sg = StringGrouper(master)  # this would build the corpus; no matching  so very fast
sg.set_master(subset_1_of_master)
sg.fit()  # matching here is only between strings in this subset: subset_1_of_master
result_1 = sg.get_matches()
sg.set_master(subset_2_of_master)
sg.fit()  # matching here is only between strings in this subset: subset_2_of_master
result_2 = sg.get_matches()

You see, in this way I avoid unwanted and time-consuming matching between strings of subset_1_of_master and strings of subset_2_of_master. At the same time, if I had not maintained the corpus across all the calls, the similarity results would have been inconsistent with each other, because they would have been based on different corpora.

As I mentioned in a previous comment (I don't know if you received it), from a predictive modeling point of view, the new series need not even have all its n-grams in the existing corpus. Sure, it may not look ideal but it is still a predictive model.

See 2nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus.

Hi @ParticularMiner,

I still don't see the use case for these "shadow member functions. ". I understand for example that you want to dedupe first on a large dataset, and later use the same StringGrouper with smaller datasets. In that case however wouldn't you just want to do something like:

sg = StringGrouper(master, duplicates)
sg.fit()
result = sg.get_matches()
# now later with a new set of dupes:
sg.set_duplicates(new_duplicates)
result = sg.get_matches()

If that make's sense? Maybe I'm just missing something?

ParticularMiner commented 3 years ago

@Bergvca

So my above code (of 7 lines) would be condensed into the following 3 lines if a shadow member function was available:

sg = StringGrouper(master)  # this would build the corpus; no matching  so very fast
result_1  = sg.match_strings(subset_1_of_master)
result_2  = sg.match_strings(subset_2_of_master)
ParticularMiner commented 3 years ago

@Bergvca

Ah I understand now. To make it more clear you could also do del left_matrix

I agree. I was ahead of you there. 😄 Did that already.

ParticularMiner commented 3 years ago

@Bergvca

See 2nd paragraph on page 50 of "Deep Learning for Natural Language Processing: Develop Deep Learning Models" By Jason Brownlee for an example of a new dataset with some n-grams not in the corpus.

ParticularMiner commented 3 years ago

@Bergvca

So one concrete example I came across while working with two users in Issue #64 and which closely follows my suggested 3-line code snippet above, went like this:

Dataset: A large DataFrame with columns 'Address' and 'State' containing data from the USA. Goal: To match addresses

The first user reasoned that it was pointless to match any two addresses that were in different states. So he would group his data by state first and perform the matching on each group. That made sense to me and also seemed efficient.

On the other hand, the second user didn't bother grouping at all and decided to filter his matching results afterwards. And it seemed OK to me too. Indeed, it was the natural way to go!

But as I studied both users' results, it turned out that the similarity scores were different between the users for the same pair of addresses. Not only that, but for the same min_similarity some matches found in the second user's data were missing in the first user's results!

That's how I remembered that the corpora had been different for each user. The second user had used a single corpus for all his matches, while the first user had used 51 different corpora (corresponding to 51 US states). Furthermore, it seemed that the smaller corpora had led to smaller similarity scores, some of which fell below the similarity threshold.

So, to mend the discrepancy between the two results, while hoping to also take advantage of the first user's tremendous gains in efficiency, I thought of a way to recode StringGrouper to fix the corpus while preserving its user-interface. Hence also red_string_grouper.

Bergvca commented 3 years ago

Ah, clear now, thanks! Question about the variables outside the init function - I think you made a comment that explained it but I no longer see it. Could you explain it again? (maybe it was deleted by accident? Or i'm just overlooking it?)

I ran the test and it really is crazy fast - almost impossibly fast but the results are correct as far as I can see.

ParticularMiner commented 3 years ago

Hi @Bergvca

Thanks for your comments.

For some reason, I'm unable to reply to your in-code comments. So I'll put my comments here:

You were right about _symmetrize_matrix and lil_matrix. I've fixed them now.

I've also fixed block_true_max_n_matches.

I've also "declared" all class members in __init__(). Let me know if that fixes the issues with your IDE. I'm not sure about the missing comments you referred to. Let me know if they are still missing after this latest commit.

Cheers.

Bergvca commented 3 years ago

Hi @ParticularMiner , thanks for all the updates. From my side I can merge if you are ready. However I do have a few questions / idea's to add:

Let me know what you think. I'm also happy to merge the new version as is and we can chose to add the above idea's in future versions.

ParticularMiner commented 3 years ago

Hi @Bergvca

  • When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?

I think that’s a good idea. I’ll add that.

  • Since you found that a blocksize of 80k strings seems to be the optimal blocksize, should we not just:
    • Add an option to guesstimate the optimal number of blocks (n_strings / 80 000)
    • If the overflow error is hit OR the user inputs a parameter that estimates the optimal number of blocks (for example n_blocks = 'estimate_optimal') this estimate is used. This could even be the default.

I had been thinking along similar lines. But one thing that kept me from following through with it was the fact that the guesstimate is both data- and machine-dependent (probably even dependent on available memory size). So estimating the actual optimal block numbers could be a more difficult problem.

But if you have any ideas on how to compute that on the fly, please do share. For instance, I believe there are python packages that can promptly report the available memory size, which in turn can be used to calculate the guesstimate.

  • Since setting the number of blocks seems to have such a big impact, should we not make it more prominent in the readme (I don't have a good idea yet how though, maybe add it to the examples?)

Sure. No problem. We can think of ways to do that.

Let me know what you think. I'm also happy to merge the new version as is and we can chose to add the above idea's in future versions.

Yes. Let’s merge now and add more later.

Thanks.

ParticularMiner commented 3 years ago

Hi @Bergvca

  • When we hit the overflow error, we are potentially already running calculations for a while or not? Should we not output a warning notifying the user of the overflow error, so that in the future it can be solved by setting the n_blocks?

I've just added the warnings. I'm still thinking about the rest ...

Bergvca commented 3 years ago

Thanks! I'll merge this now and we can think about further enhancements later

Bergvca commented 3 years ago

Merged and updated to pypi