IR-based dataset search

vidma commented 10 years ago

in pull request #4161

one quite good idea was generated that sometimes it might be quite hard to find a dataset with specific features user is interested in (e.g. user do not know the order of these features in the dataset name, and thus current autocompletion might be not sufficient). either [my ideas didn't discuss this very deep]:

[ ] information retrieval-based keyword search on the dataset name might make sense.
user types dataset="multijet zmm whaever" and DAS could propose the matching dataset names in IR way. This should be quite easy to implement as we already have IR engine (whoosh), and number of datasets is not that high.
[ ] a dataset finder sub-window. not sure yet how it shall look/behave, but it could incorporate a "keyword search" in addition with some rules of how dataset names are formed... I think this needs further thought, taking into account the rules of how the dataset names are actually formed.

vidma commented 10 years ago

I have implemented early prototype of IR-based dataset search (see 989249147c92e4b22b05497377293ab9d0cfbdd8)

The index is built by tokenizing (spliting) the dataset name into separate tokens by -_/. Only full token match is currently in use - it might return less results, but the matches shall be expected to be more accurate.

with all the DBS3 instances, the IR index is around 80MB, and building it takes a couple of minutes. Querying the index is fast (well sub-second), even if the whoosh library is purely python.

as building the index takes considerable amount of time, it shall be rebuilt only from time to time over cron job, or built incrementally (a bit more complex).

vidma commented 10 years ago

now we have to think how to integrate it with DAS. shall it be shown before KWS (but when I would not show any KWS suggestions until users provides a regular pattern with spaces) sort of as a separate query processing step like we have with wildcards, or probably better just within autocompletion?

if user types: dataset="abc def gef" the autocompletion would provide some of the matching suggestions based on IR-search... this would be easiest way to integrate this feature.

so I propose:

autocompletion integration - shall be easy
(optional) if user types `dataset="abc def gef"`` (and possibly smf else), it would be asked to choose/provide dataset pattern, and only afterwards KWS would show the final query suggestions

vidma commented 10 years ago

here is the first try to integrate it into autocompletion:

ir_search_results

I experimented with two versions:

all tokens are optional to be matches, pure IR - slow for longer token list
all tokens must be matched, ordering defined by token "uniqness" and same-case matches are scored higher - really fast
- so far no partial token matches, e.g. patch would not match anything, as full token is patch1.
- I guess requiring all tokens to match makes more sense in autocompletion, partial tokens would be nice...

P.S. https://github.com/dmwm/DAS/compare/ir_dataset_searcher?expand=1

vkuznet commented 10 years ago

On 0, vidma notifications@github.com wrote:

here is the first try to integrate it into autocompletion:

I experimented with two versions:

all tokens are optional to be matches, pure IR - slow for longer token list

all tokens must be matched, ordering defined by token "uniqness" and same-case matches are scored higher - really fast

so far no partial token matches, e.g. patch would not match anything, as full token is patch1.

I guess requiring all tokens to match makes more sense in autocompletion, partial tokens would be nice...

I think requiring all token as first approximation should be sufficient. And as far as I can tell it looks like tokens are case insensitive, right?

Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4157#issuecomment-35409797

vidma commented 10 years ago

yes it matches both, but case-sensitive matches are scored higher than the matches of different case.

dmwm / DAS

IR-based dataset search #4157