dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.15k stars 551 forks source link

Change predicate function signatures #1147

Closed lmores closed 1 year ago

lmores commented 1 year ago

This PR implements the changes discussed in #1146 and some consequences thereof.

Changes contained in this PR

TO DO?

@fgregg: as you see I took the liberty to apply some more changes other than those we discussed in #1146, let my know if you agree on them or if I have to revert something.

codecov[bot] commented 1 year ago

Codecov Report

Base: 73.83% // Head: 73.72% // Decreases project coverage by -0.11% :warning:

Coverage data is based on head (d97e48c) compared to base (4e44a5f). Patch coverage: 91.30% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1147 +/- ## ========================================== - Coverage 73.83% 73.72% -0.11% ========================================== Files 28 29 +1 Lines 2308 2322 +14 ========================================== + Hits 1704 1712 +8 - Misses 604 610 +6 ``` | [Impacted Files](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio) | Coverage Δ | | |---|---|---| | [dedupe/training.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3RyYWluaW5nLnB5) | `63.03% <50.00%> (-0.46%)` | :arrow_down: | | [dedupe/labeler.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL2xhYmVsZXIucHk=) | `75.56% <66.66%> (-0.69%)` | :arrow_down: | | [dedupe/predicates.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3ByZWRpY2F0ZXMucHk=) | `76.47% <70.37%> (-7.49%)` | :arrow_down: | | [dedupe/\_typing.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL190eXBpbmcucHk=) | `92.95% <100.00%> (ø)` | | | [dedupe/predicate\_functions.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3ByZWRpY2F0ZV9mdW5jdGlvbnMucHk=) | `100.00% <100.00%> (ø)` | | | [dedupe/variables/string.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3ZhcmlhYmxlcy9zdHJpbmcucHk=) | `86.84% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

lmores commented 1 year ago

Is there a way to run all CI tests locally (including all different python version) to avoid to push and find out problems on GitHub? (sorry for the noob question)

lmores commented 1 year ago

Sorry for the mess of force pushes, seems that I fixed all problems except those raised by mypy and black: on my machine they run fine without errors, so I can not reproduce (and fix) them.

I'm done force-pushing patches -.-'

fgregg commented 1 year ago

just rebase this onto main so we can make sure the linting passes and we’ll be good to go

fgregg commented 1 year ago

@benchmark

github-actions[bot] commented 1 year ago

All benchmarks (diff):

before after ratio benchmark
536M 536M 1.00 canonical.Canonical.peakmem_run
13.7±0s 14.9±0.4s 1.08 canonical.Canonical.time_run
0.962 0.935 0.97 canonical.Canonical.track_precision
0.902 0.902 1.00 canonical.Canonical.track_recall
236M 240M 1.02 canonical_gazetteer.Gazetteer.peakmem_run(None)
12.8±0.2s 13.7±0.08s 1.07 canonical_gazetteer.Gazetteer.time_run(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_precision(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_recall(None)
236M 240M 1.02 canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
235M 240M 1.02 canonical_matching.Matching.peakmem_run({'threshold': 0.5})
11.7±0.01s 11.8±0.02s 1.00 canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
11.8±0.02s 11.9±0.04s 1.01 canonical_matching.Matching.time_run({'threshold': 0.5})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

fgregg commented 1 year ago

i'm a bit concerned by the slow down in what is supposed to be a performance improving PR. let's run @benchmark again

lmores commented 1 year ago

Mmm, with frozensets in place I do no more expect any performance improvement as nothing in the long comment that I placed at the begging of predicate_functions.py do apply anymore.

fgregg commented 1 year ago

i thought your argument was that not casting to sets in the labeler routines were where we would see performance gains?

fgregg commented 1 year ago

@benchmark

lmores commented 1 year ago

Yes, you are right, there was also that argument... could it be that frozensets are slightly less performant than sets in CPython? (don't know, just wondering)

github-actions[bot] commented 1 year ago

All benchmarks (diff):

before after ratio benchmark
536M 536M 1.00 canonical.Canonical.peakmem_run
14.6±0.02s 14.6±0.03s 1.00 canonical.Canonical.time_run
0.904 0.87 0.96 canonical.Canonical.track_precision
0.902 0.902 1.00 canonical.Canonical.track_recall
237M 240M 1.01 canonical_gazetteer.Gazetteer.peakmem_run(None)
13.5±0.02s 13.5±0.02s 1.00 canonical_gazetteer.Gazetteer.time_run(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_precision(None)
0.973 0.982 1.01 canonical_gazetteer.Gazetteer.track_recall(None)
237M 240M 1.01 canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
236M 240M 1.01 canonical_matching.Matching.peakmem_run({'threshold': 0.5})
11.9±0s 12.2±0.02s 1.02 canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
12.0±0s 12.0±0.02s 1.00 canonical_matching.Matching.time_run({'threshold': 0.5})
0.99 0.981 0.99 canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

fgregg commented 1 year ago

okay that slow down was just line noise. i think this a nice improvement and standardization even if it's not a clear performance win.