Closed lmores closed 1 year ago
Base: 73.83% // Head: 73.72% // Decreases project coverage by -0.11%
:warning:
Coverage data is based on head (
d97e48c
) compared to base (4e44a5f
). Patch coverage: 91.30% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Is there a way to run all CI tests locally (including all different python version) to avoid to push and find out problems on GitHub? (sorry for the noob question)
Sorry for the mess of force pushes, seems that I fixed all problems except those raised by mypy and black: on my machine they run fine without errors, so I can not reproduce (and fix) them.
I'm done force-pushing patches -.-'
just rebase this onto main so we can make sure the linting passes and we’ll be good to go
@benchmark
before | after | ratio | benchmark | |
---|---|---|---|---|
536M | 536M | 1.00 | canonical.Canonical.peakmem_run | |
13.7±0s | 14.9±0.4s | 1.08 | canonical.Canonical.time_run | |
0.962 | 0.935 | 0.97 | canonical.Canonical.track_precision | |
0.902 | 0.902 | 1.00 | canonical.Canonical.track_recall | |
236M | 240M | 1.02 | canonical_gazetteer.Gazetteer.peakmem_run(None) | |
12.8±0.2s | 13.7±0.08s | 1.07 | canonical_gazetteer.Gazetteer.time_run(None) | |
0.982 | 0.982 | 1.00 | canonical_gazetteer.Gazetteer.track_precision(None) | |
0.982 | 0.982 | 1.00 | canonical_gazetteer.Gazetteer.track_recall(None) | |
236M | 240M | 1.02 | canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
235M | 240M | 1.02 | canonical_matching.Matching.peakmem_run({'threshold': 0.5}) | |
11.7±0.01s | 11.8±0.02s | 1.00 | canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
11.8±0.02s | 11.9±0.04s | 1.01 | canonical_matching.Matching.time_run({'threshold': 0.5}) | |
0.99 | 0.99 | 1.00 | canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
0.99 | 0.99 | 1.00 | canonical_matching.Matching.track_precision({'threshold': 0.5}) | |
0.911 | 0.911 | 1.00 | canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
0.911 | 0.911 | 1.00 | canonical_matching.Matching.track_recall({'threshold': 0.5}) |
i'm a bit concerned by the slow down in what is supposed to be a performance improving PR. let's run @benchmark again
Mmm, with frozensets in place I do no more expect any performance improvement as nothing in the long comment that I placed at the begging of predicate_functions.py do apply anymore.
i thought your argument was that not casting to sets in the labeler routines were where we would see performance gains?
@benchmark
Yes, you are right, there was also that argument... could it be that frozensets are slightly less performant than sets in CPython? (don't know, just wondering)
before | after | ratio | benchmark | |
---|---|---|---|---|
536M | 536M | 1.00 | canonical.Canonical.peakmem_run | |
14.6±0.02s | 14.6±0.03s | 1.00 | canonical.Canonical.time_run | |
0.904 | 0.87 | 0.96 | canonical.Canonical.track_precision | |
0.902 | 0.902 | 1.00 | canonical.Canonical.track_recall | |
237M | 240M | 1.01 | canonical_gazetteer.Gazetteer.peakmem_run(None) | |
13.5±0.02s | 13.5±0.02s | 1.00 | canonical_gazetteer.Gazetteer.time_run(None) | |
0.982 | 0.982 | 1.00 | canonical_gazetteer.Gazetteer.track_precision(None) | |
0.973 | 0.982 | 1.01 | canonical_gazetteer.Gazetteer.track_recall(None) | |
237M | 240M | 1.01 | canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
236M | 240M | 1.01 | canonical_matching.Matching.peakmem_run({'threshold': 0.5}) | |
11.9±0s | 12.2±0.02s | 1.02 | canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
12.0±0s | 12.0±0.02s | 1.00 | canonical_matching.Matching.time_run({'threshold': 0.5}) | |
0.99 | 0.981 | 0.99 | canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
0.99 | 0.99 | 1.00 | canonical_matching.Matching.track_precision({'threshold': 0.5}) | |
0.911 | 0.911 | 1.00 | canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'}) | |
0.911 | 0.911 | 1.00 | canonical_matching.Matching.track_recall({'threshold': 0.5}) |
okay that slow down was just line noise. i think this a nice improvement and standardization even if it's not a clear performance win.
This PR implements the changes discussed in #1146 and some consequences thereof.
Changes contained in this PR
predicate_functions.py
test_predicate_functions.py
test_predicates.py
have been moved totest_predicate_functions.py
oneGramFingerprint()
andtwoGramFingerprint()
has changed (should be better now), without affecting the output. Insidetest_predicate_functions.py
there are a few assertions that compare the ouput of the new and the old implementations.We agreed that each predicate function should return a
set
since we want unique items and we do not care about the order. However, as explained at the beginning ofpredicate_functions.py
, the starting capacity of sets in CPython implementation is 4.Since many predicate functions always return collections with at most one item, always using sets would require much more memory than actually needed. To avoid wasting memory we return tuples with just one (or zero) items whenever the output of a predicate function never contains more than one item and we return actual sets otherwise.
training.py
andlabeler.py
. In both of them we had to check whether the output of two predicates had non empty intersection. The old way to do so was to buildset(pred_1_output) & set(pred_2_output)
and check whether it is empty or not. The new approach is to check whether they share at least one item, without computing the whole intersection. Sincepred_1_output
andpred_2_output
are sets or tuples with at most one item, it is "fast" to loop throught the smallest one and check wheter at least one of its items is contained in the other.predicate_functions.py
changedto
in this way strings like
"go-away"
are considered to be made of two tokens, instead of just one (I think this is the expected behaviour).TO DO?
training.py
andlabeler.py
in Cython? (in another PR)@fgregg: as you see I took the liberty to apply some more changes other than those we discussed in #1146, let my know if you agree on them or if I have to revert something.