dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.15k stars 551 forks source link

Improve cpredicates.pyx #1145

Closed lmores closed 1 year ago

lmores commented 1 year ago

Changes:

Not sure how to check runtime improvement, using python benchmarks/benchmarks/canonical.py execution time is 11,xxx seconds both before and after the change.

@fgregg: I am likely to open many more PR like this. Please tell me if you are fine with them. Of course, if I plan to submit bigger changes I will open a thread to discuss them before actually implementing them.

codecov[bot] commented 1 year ago

Codecov Report

Base: 73.84% // Head: 73.84% // No change to project coverage :thumbsup:

Coverage data is based on head (e88470b) compared to base (baa6071). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1145 +/- ## ======================================= Coverage 73.84% 73.84% ======================================= Files 28 28 Lines 2294 2294 ======================================= Hits 1694 1694 Misses 600 600 ``` | [Impacted Files](https://codecov.io/gh/dedupeio/dedupe/pull/1145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio) | Coverage Δ | | |---|---|---| | [dedupe/predicates.py](https://codecov.io/gh/dedupeio/dedupe/pull/1145?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3ByZWRpY2F0ZXMucHk=) | `83.95% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

fgregg commented 1 year ago

the uniqueness guarantee is required

lmores commented 1 year ago

the uniqueness guarantee is required

But was not enforced, right?

fgregg commented 1 year ago

it was enforced by the set object

fgregg commented 1 year ago

hmm that’s true

lmores commented 1 year ago

Sorry, but I don't understand. At the moment I see not set() object inside cpredicates.pyx, in particular ngrams is a list. I can make it a set if necessary.

fgregg commented 1 year ago

you are right that the current code doesn’t enforce uniqueness, i’ll have to check where that is enforced

fgregg commented 1 year ago

everywhere we call this in predicates.py, we call set on it.

it's a bit silly to do this. let's have cpredicates fill out a set and then not have those set calls in predicates.py.

lmores commented 1 year ago

Actually there is one exception:

class TfidfNGramPredicate(IndexPredicate):
    def preprocess(self, doc: str) -> Sequence[str]:
        return tuple(sorted(ngrams(" ".join(strip_punc(doc).split()), 2)))

But we probably want the ngrams to be unique also here?

fgregg commented 1 year ago

ah.. we actually don't want ngrams to be unique there.

lmores commented 1 year ago

How about the unique_ngrams function I added in the last commit?

fgregg commented 1 year ago

looks good!