dmwm / DAS

Data Aggregation System
11 stars 7 forks source link

IR-based dataset search #4157

Closed vidma closed 6 years ago

vidma commented 10 years ago

in pull request #4161

one quite good idea was generated that sometimes it might be quite hard to find a dataset with specific features user is interested in (e.g. user do not know the order of these features in the dataset name, and thus current autocompletion might be not sufficient). either [my ideas didn't discuss this very deep]:

vidma commented 10 years ago

I have implemented early prototype of IR-based dataset search (see 989249147c92e4b22b05497377293ab9d0cfbdd8)

The index is built by tokenizing (spliting) the dataset name into separate tokens by -_/. Only full token match is currently in use - it might return less results, but the matches shall be expected to be more accurate.

with all the DBS3 instances, the IR index is around 80MB, and building it takes a couple of minutes. Querying the index is fast (well sub-second), even if the whoosh library is purely python.

as building the index takes considerable amount of time, it shall be rebuilt only from time to time over cron job, or built incrementally (a bit more complex).

vidma commented 10 years ago

now we have to think how to integrate it with DAS. shall it be shown before KWS (but when I would not show any KWS suggestions until users provides a regular pattern with spaces) sort of as a separate query processing step like we have with wildcards, or probably better just within autocompletion?

if user types: dataset="abc def gef" the autocompletion would provide some of the matching suggestions based on IR-search... this would be easiest way to integrate this feature.

so I propose:

  1. autocompletion integration - shall be easy
  2. (optional) if user types `dataset="abc def gef"`` (and possibly smf else), it would be asked to choose/provide dataset pattern, and only afterwards KWS would show the final query suggestions
vidma commented 10 years ago

here is the first try to integrate it into autocompletion:

ir_search_results

I experimented with two versions:

P.S. https://github.com/dmwm/DAS/compare/ir_dataset_searcher?expand=1

vkuznet commented 10 years ago

On 0, vidma notifications@github.com wrote:

here is the first try to integrate it into autocompletion:

ir_search_results

I experimented with two versions:

  • all tokens are optional to be matches, pure IR - slow for longer token list
  • all tokens must be matched, ordering defined by token "uniqness" and same-case matches are scored higher - really fast
    • so far no partial token matches, e.g. patch would not match anything, as full token is patch1.
    • I guess requiring all tokens to match makes more sense in autocompletion, partial tokens would be nice...

I think requiring all token as first approximation should be sufficient. And as far as I can tell it looks like tokens are case insensitive, right?


Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4157#issuecomment-35409797

vidma commented 10 years ago

yes it matches both, but case-sensitive matches are scored higher than the matches of different case.