Closed lmores closed 1 year ago
Base: 73.84% // Head: 73.84% // No change to project coverage :thumbsup:
Coverage data is based on head (
e88470b
) compared to base (baa6071
). Patch coverage: 100.00% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
the uniqueness guarantee is required
the uniqueness guarantee is required
But was not enforced, right?
it was enforced by the set object
hmm that’s true
Sorry, but I don't understand. At the moment I see not set() object inside cpredicates.pyx, in particular ngrams is a list. I can make it a set if necessary.
you are right that the current code doesn’t enforce uniqueness, i’ll have to check where that is enforced
everywhere we call this in predicates.py, we call set on it.
it's a bit silly to do this. let's have cpredicates fill out a set and then not have those set calls in predicates.py.
Actually there is one exception:
class TfidfNGramPredicate(IndexPredicate):
def preprocess(self, doc: str) -> Sequence[str]:
return tuple(sorted(ngrams(" ".join(strip_punc(doc).split()), 2)))
But we probably want the ngrams to be unique also here?
ah.. we actually don't want ngrams to be unique there.
How about the unique_ngrams
function I added in the last commit?
looks good!
Changes:
Not sure how to check runtime improvement, using
python benchmarks/benchmarks/canonical.py
execution time is11,xxx
seconds both before and after the change.@fgregg: I am likely to open many more PR like this. Please tell me if you are fine with them. Of course, if I plan to submit bigger changes I will open a thread to discuss them before actually implementing them.