bskinn / sphobjinv

Toolkit for manipulation and inspection of Sphinx objects.inv files
https://sphobjinv.readthedocs.io
MIT License
78 stars 9 forks source link

Revise handling of index and score in `Inventory.suggest` #220

Open bskinn opened 2 years ago

bskinn commented 2 years ago

The current stringify-then-regex-extract approach is kind of horrifying. ~I must have been on a regex kick when I wrote it.~ But, it was the best way I could think of at the time to retain each object's index value when passed into fuzzywuzzy.process.

Should be possible to just keep everything as tuples throughout? Catch might be on trying to implement #213, since a multiprocess implementation likely won't retain the ordering of the items, and so a simple enumerate(...) on the (e.g.) fwp.process() call might end up with meaningless index values.

On the other hand, this sort of ugly stringified approach will probably be horrid for trying to integrate with pluggable scoring-callables (#207), and a re-implementation will be needed anyways.

Might be better to implement my own scoring based on difflib, move away from fuzzywuzzy? Or, perhaps switch to a single-string scoring function within fuzzywuzzy? (That switch doesn't help dealing with possible loss of ordering by multiprocessed and/or third-party scoring functions....)

eirrgang commented 2 years ago

I notice that underscores don't contibute much to the matching score. Is this from the scoring heuristic or from the regex?

For instance, searching sphobjinv suggest https://docs.python.org/3/library '__module__' -su scores everything containing the string "module" with 90, including every occurrence of a :py:module: suggestion.

bskinn commented 2 years ago

It's just the way fuzzywuzzy works, @eirrgang -- I believe it strips non-alphanumeric characters, which is why the underscores don't affect the match.

In terms of the broader question of scoring quality, fuzzywuzzy does a Levenshtein-style string-diff calculation, and then transforms that to a 0-100 scale in some fashion. I haven't ever taken a close look at what it's doing... it's definitely not an optimal scoring function, but it worked well enough when I was looking for something lightweight and easy to integrate.

I want to work toward #207, so that users can develop and use higher-quality and/or customized scoring functions. I'm picturing a full plugin system, so that rather than housing a bunch of different scoring functions in sphobjinv itself, they can be maintained as separate sphobjinv-scoring-foo packages. Not sure how long that will take, though. I suspect the work I'm going to need to do to implement #178 will also move part of the way toward #207... will see.