Closed vidma closed 6 years ago
I have implemented early prototype of IR-based dataset search (see 989249147c92e4b22b05497377293ab9d0cfbdd8)
The index is built by tokenizing (spliting) the dataset name into separate tokens by -_/
. Only full token match is currently in use - it might return less results, but the matches shall be expected to be more accurate.
with all the DBS3 instances, the IR index is around 80MB, and building it takes a couple of minutes.
Querying the index is fast (well sub-second), even if the whoosh
library is purely python.
as building the index takes considerable amount of time, it shall be rebuilt only from time to time over cron job, or built incrementally (a bit more complex).
now we have to think how to integrate it with DAS. shall it be shown before KWS (but when I would not show any KWS suggestions until users provides a regular pattern with spaces) sort of as a separate query processing step like we have with wildcards, or probably better just within autocompletion?
if user types: dataset="abc def gef"
the autocompletion would provide some of the matching suggestions based on IR-search... this would be easiest way to integrate this feature.
so I propose:
here is the first try to integrate it into autocompletion:
I experimented with two versions:
patch
would not match anything, as full token is patch1
.P.S. https://github.com/dmwm/DAS/compare/ir_dataset_searcher?expand=1
On 0, vidma notifications@github.com wrote:
here is the first try to integrate it into autocompletion:
I experimented with two versions:
- all tokens are optional to be matches, pure IR - slow for longer token list
- all tokens must be matched, ordering defined by token "uniqness" and same-case matches are scored higher - really fast
- so far no partial token matches, e.g.
patch
would not match anything, as full token ispatch1
.- I guess requiring all tokens to match makes more sense in autocompletion, partial tokens would be nice...
I think requiring all token as first approximation should be sufficient. And as far as I can tell it looks like tokens are case insensitive, right?
Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4157#issuecomment-35409797
yes it matches both, but case-sensitive matches are scored higher than the matches of different case.
in pull request #4161
one quite good idea was generated that sometimes it might be quite hard to find a dataset with specific features user is interested in (e.g. user do not know the order of these features in the dataset name, and thus current autocompletion might be not sufficient). either [my ideas didn't discuss this very deep]:
dataset="multijet zmm whaever"
and DAS could propose the matching dataset names in IR way. This should be quite easy to implement as we already have IR engine (whoosh), and number of datasets is not that high.