Open bskinn opened 2 years ago
I notice that underscores don't contibute much to the matching score. Is this from the scoring heuristic or from the regex?
For instance, searching sphobjinv suggest https://docs.python.org/3/library '__module__' -su
scores everything containing the string "module" with 90, including every occurrence of a :py:module:
suggestion.
It's just the way fuzzywuzzy
works, @eirrgang -- I believe it strips non-alphanumeric characters, which is why the underscores don't affect the match.
In terms of the broader question of scoring quality, fuzzywuzzy
does a Levenshtein-style string-diff calculation, and then transforms that to a 0-100 scale in some fashion. I haven't ever taken a close look at what it's doing... it's definitely not an optimal scoring function, but it worked well enough when I was looking for something lightweight and easy to integrate.
I want to work toward #207, so that users can develop and use higher-quality and/or customized scoring functions. I'm picturing a full plugin system, so that rather than housing a bunch of different scoring functions in sphobjinv
itself, they can be maintained as separate sphobjinv-scoring-foo
packages. Not sure how long that will take, though. I suspect the work I'm going to need to do to implement #178 will also move part of the way toward #207... will see.
The current stringify-then-regex-extract approach is kind of horrifying. ~I must have been on a regex kick when I wrote it.~ But, it was the best way I could think of at the time to retain each object's
index
value when passed intofuzzywuzzy.process
.Should be possible to just keep everything as tuples throughout? Catch might be on trying to implement #213, since a multiprocess implementation likely won't retain the ordering of the items, and so a simple
enumerate(...)
on the (e.g.)fwp.process()
call might end up with meaningless index values.On the other hand, this sort of ugly stringified approach will probably be horrid for trying to integrate with pluggable scoring-callables (#207), and a re-implementation will be needed anyways.
Might be better to implement my own scoring based on
difflib
, move away fromfuzzywuzzy
? Or, perhaps switch to a single-string scoring function withinfuzzywuzzy
? (That switch doesn't help dealing with possible loss of ordering by multiprocessed and/or third-party scoring functions....)