lmores commented 1 year ago

This PR implements the changes discussed in #1146 and some consequences thereof.

Changes contained in this PR

Predicate functions have been moved to a separate file named predicate_functions.py
New tests for nearly all predicate functions have been written in test_predicate_functions.py
Old tests for predicate functions inside test_predicates.py have been moved to test_predicate_functions.py
The implementation of oneGramFingerprint() and twoGramFingerprint() has changed (should be better now), without affecting the output. Inside test_predicate_functions.py there are a few assertions that compare the ouput of the new and the old implementations.
The signature of predicate functions has become:
```
PredicateOutput = Union[set[str], tuple[str], tuple[()]]
PredicateFunction = Callable[[Any], PredicateOutput]
```
We agreed that each predicate function should return a set since we want unique items and we do not care about the order. However, as explained at the beginning of predicate_functions.py, the starting capacity of sets in CPython implementation is 4.
```
>>> sys.getsizeof({"a"})
216
>>> sys.getsizeof({"a","b","c","d"})
216
>>> sys.getsizeof({"a","b","c","d","e"})
472
```
Since many predicate functions always return collections with at most one item, always using sets would require much more memory than actually needed. To avoid wasting memory we return tuples with just one (or zero) items whenever the output of a predicate function never contains more than one item and we return actual sets otherwise.
Optimized loops in training.py and labeler.py. In both of them we had to check whether the output of two predicates had non empty intersection. The old way to do so was to build set(pred_1_output) & set(pred_2_output) and check whether it is empty or not. The new approach is to check whether they share at least one item, without computing the whole intersection. Since pred_1_output and pred_2_output are sets or tuples with at most one item, it is "fast" to loop throught the smallest one and check wheter at least one of its items is contained in the other.
In predicate_functions.py changed
```
two_start_words = re.compile(r"^([\w']+\s+[\w']+)").match
```
to
```
two_start_words = re.compile(r"^([\w']+\W+[\w']+)").match
```
in this way strings like "go-away" are considered to be made of two tokens, instead of just one (I think this is the expected behaviour).

TO DO?

To further improve memory usage we could always return the same instance of a frozenset whenever we must return an empty collection. However this would rely on the assumption that the output of a predicate function is never modified, just read. Is it true???
Implementing all predicate function in Cython would significantly improve the performance? (in another PR)
Implement the trick used in training.py and labeler.py in Cython? (in another PR)

@fgregg: as you see I took the liberty to apply some more changes other than those we discussed in #1146, let my know if you agree on them or if I have to revert something.

codecov[bot] commented 1 year ago

Codecov Report

Base: 73.83% // Head: 73.72% // Decreases project coverage by -0.11% :warning:

Coverage data is based on head (d97e48c) compared to base (4e44a5f). Patch coverage: 91.30% of modified lines in pull request are covered.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1147 +/- ## ========================================== - Coverage 73.83% 73.72% -0.11% ========================================== Files 28 29 +1 Lines 2308 2322 +14 ========================================== + Hits 1704 1712 +8 - Misses 604 610 +6 ``` | [Impacted Files](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio) | Coverage Δ | | |---|---|---| | [dedupe/training.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3RyYWluaW5nLnB5) | `63.03% <50.00%> (-0.46%)` | :arrow_down: | | [dedupe/labeler.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL2xhYmVsZXIucHk=) | `75.56% <66.66%> (-0.69%)` | :arrow_down: | | [dedupe/predicates.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3ByZWRpY2F0ZXMucHk=) | `76.47% <70.37%> (-7.49%)` | :arrow_down: | | [dedupe/\_typing.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL190eXBpbmcucHk=) | `92.95% <100.00%> (ø)` | | | [dedupe/predicate\_functions.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3ByZWRpY2F0ZV9mdW5jdGlvbnMucHk=) | `100.00% <100.00%> (ø)` | | | [dedupe/variables/string.py](https://codecov.io/gh/dedupeio/dedupe/pull/1147?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio#diff-ZGVkdXBlL3ZhcmlhYmxlcy9zdHJpbmcucHk=) | `86.84% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dedupeio)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

lmores commented 1 year ago

Is there a way to run all CI tests locally (including all different python version) to avoid to push and find out problems on GitHub? (sorry for the noob question)

lmores commented 1 year ago

Sorry for the mess of force pushes, seems that I fixed all problems except those raised by mypy and black: on my machine they run fine without errors, so I can not reproduce (and fix) them.

I'm done force-pushing patches -.-'

fgregg commented 1 year ago

just rebase this onto main so we can make sure the linting passes and we’ll be good to go

fgregg commented 1 year ago

@benchmark

github-actions[bot] commented 1 year ago

All benchmarks (diff):

	before	after	ratio
536M	536M	1.00	canonical.Canonical.peakmem_run
13.7±0s	14.9±0.4s	1.08	canonical.Canonical.time_run
0.962	0.935	0.97	canonical.Canonical.track_precision
0.902	0.902	1.00	canonical.Canonical.track_recall
236M	240M	1.02	canonical_gazetteer.Gazetteer.peakmem_run(None)
12.8±0.2s	13.7±0.08s	1.07	canonical_gazetteer.Gazetteer.time_run(None)
0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_precision(None)
0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_recall(None)
236M	240M	1.02	canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
235M	240M	1.02	canonical_matching.Matching.peakmem_run({'threshold': 0.5})
11.7±0.01s	11.8±0.02s	1.00	canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
11.8±0.02s	11.9±0.04s	1.01	canonical_matching.Matching.time_run({'threshold': 0.5})
0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

fgregg commented 1 year ago

i'm a bit concerned by the slow down in what is supposed to be a performance improving PR. let's run @benchmark again

lmores commented 1 year ago

Mmm, with frozensets in place I do no more expect any performance improvement as nothing in the long comment that I placed at the begging of predicate_functions.py do apply anymore.

fgregg commented 1 year ago

i thought your argument was that not casting to sets in the labeler routines were where we would see performance gains?

fgregg commented 1 year ago

@benchmark

lmores commented 1 year ago

Yes, you are right, there was also that argument... could it be that frozensets are slightly less performant than sets in CPython? (don't know, just wondering)

github-actions[bot] commented 1 year ago

All benchmarks (diff):

	before	after	ratio
536M	536M	1.00	canonical.Canonical.peakmem_run
14.6±0.02s	14.6±0.03s	1.00	canonical.Canonical.time_run
0.904	0.87	0.96	canonical.Canonical.track_precision
0.902	0.902	1.00	canonical.Canonical.track_recall
237M	240M	1.01	canonical_gazetteer.Gazetteer.peakmem_run(None)
13.5±0.02s	13.5±0.02s	1.00	canonical_gazetteer.Gazetteer.time_run(None)
0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_precision(None)
0.973	0.982	1.01	canonical_gazetteer.Gazetteer.track_recall(None)
237M	240M	1.01	canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
236M	240M	1.01	canonical_matching.Matching.peakmem_run({'threshold': 0.5})
11.9±0s	12.2±0.02s	1.02	canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
12.0±0s	12.0±0.02s	1.00	canonical_matching.Matching.time_run({'threshold': 0.5})
0.99	0.981	0.99	canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

fgregg commented 1 year ago

okay that slow down was just line noise. i think this a nice improvement and standardization even if it's not a clear performance win.

dedupeio / dedupe

Change predicate function signatures #1147

Changes contained in this PR

TO DO?

Codecov Report

All benchmarks (diff):

All benchmarks (diff):