Heliosearch / heliosearch

The next generation of open source search
http://heliosearch.org
90 stars 19 forks source link

Expand component #1

Open joel-bernstein opened 10 years ago

joel-bernstein commented 10 years ago

This issue introduces a new search component called the Expand component. The Expand component implements group expansion for a single page of results collapsed by the CollapsingQParserPlugin

I'll be working this ticket initially in my fork of the Heliosearch project in a branch called "expand".

https://github.com/joelbernstein2013/heliosearch

krantiparisa commented 10 years ago

how costly it is in terms of Memory along with Grouping? Is it scalable with 100 groups and each group has 3 sub groups and each sub groups has 100 docs?

And run this on top of an index with 10M docs?

joel-bernstein commented 10 years ago

The expand component works with a single page of collapsed results. So if your page has 100 groups, with 3 sub groups, with 100 docs each, the component will have to work with 30,000 documents.

Not an overwhelming number but not a small number.

The 10 million document set will be collapsed by the CollapsingQParserPlugin. How many distinct top level groups are in the index? It sounds like there might be around 33,333 distinct top level groups if each top level group has 300 docs in it. The CollapsingQParserPlugin will eat that for lunch, very little memory used.

joel-bernstein commented 10 years ago

Kranti,

I'll be putting the initial implementation up later today or over the weekend. It doesn't cover sub-grouping yet. So if you want to work on that, that would be excellent. We can collaborate on how to add this to the code.

Joel

krantiparisa commented 10 years ago

How many distinct top level groups are in the index?

can you help me to roughly estimate the memory size and response time does this have any possible cache hits to get faster responses?

krantiparisa commented 10 years ago

Sure, I can work with you on this. you might need to answer my stupid questions at times :)

joel-bernstein commented 10 years ago

The CollapsingQParserPlugin creates arrays based on the total number of unique values in the field. Rough esitimates for 300,000 unique terms in the field would be 3-5 MB of transient memory per query.

The expanding of groups I haven't measured yet. With such a large page, part of the issue will be retrieving the stored values for all those documents. This can be very expensive.

krantiparisa commented 10 years ago

if we just need docIds at the docList level, means

group1=>1234567 (the value of the group field) subgroup1=>catalog1 (the value of the sub group field) docList=> list of doc ids subgroup2=>catalog2 (the value of the sub group field) docList=> list of doc ids group2=>6764237 (the value of the group field) subgroup1=>catalog1 (the value of the sub group field) docList=> list of doc ids subgroup2=>catalog2 (the value of the sub group field) docList=> list of doc ids

if we get TopGroups like the above, then metadata can be based on what fields the user wants. I am trying find out the memory and response times for the above structure from the API call.

krantiparisa commented 10 years ago

Joel,

Is it possible to share the ExpandComponent on Saturday (11 Jan), I can spend good time on Sunday and try to get the Sub Groups. I want to also run few performance tests using traditional grouping and the new implementation for collapsing+expanding in the use cases I was describing above.

joel-bernstein commented 10 years ago

Just committed initial implementation of the ExpandComponent at my heliosearch clone in the expand branch:

https://github.com/joelbernstein2013/heliosearch/tree/expand

Initial patch compiles but has not been tested yet.

VadimKirilchuk commented 10 years ago

I think it's worth to point to commit itself https://github.com/joelbernstein2013/heliosearch/commit/c6db5bc6d368381940e63aaf6e60318ceb9e9a33

2014/1/11 joelbernstein2013 notifications@github.com

Just committed initial implementation of the ExpandComponent at my heliosearch clone in the expand branch:

https://github.com/joelbernstein2013/heliosearch/tree/expand

— Reply to this email directly or view it on GitHubhttps://github.com/Heliosearch/heliosearch/issues/1#issuecomment-32103221 .

krantiparisa commented 10 years ago

Joel,

I deployed your branch code and started Solr with a pre-populated index having 5M+ documents.

Sample Query:

http://localhost:8983/solr/collection1/select?q=relatedAllIds:8118784557012618112 AND showingType:linear&wt=xml&fq={!collapse field=programId min=windowStart}&fl=programId,windowStart&expand=true&expand.field=showingId&expand.limit=5&expand.rows=1&start=0&rows=2&sort=windowStart asc

Idea is to get the distinct program ids (collapsing/grouping) and sort them based on the windowStart field. Here is the response

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">28</int>
<lst name="params">
<str name="expand.rows">1</str>
<str name="sort">windowStart asc</str>
<str name="fl">programId,windowStart</str>
<str name="expand.limit">5</str>
<str name="start">0</str>
<str name="q">
relatedAllIds:8118784557012618112 AND showingType:linear
</str>
<str name="expand">true</str>
<str name="wt">xml</str>
<str name="fq">{!collapse field=programId min=windowStart}</str>
<str name="rows">2</str>
<str name="expand.field">showingId</str>
</lst>
</lst>
<result name="response" numFound="77" start="0">
<doc>
<long name="programId">8050846173392254112</long>
<long name="windowStart">1389375000000</long>
</doc>
<doc>
<long name="programId">8837586713084788112</long>
<long name="windowStart">1389382200000</long>
</doc>
</result>
<lst name="expanded"/>
</response>

Why is the expanded result is empty? My expectation is, from the collapsed result, for each programId get top 5 showings sorted by windowStart. how to form the query?

yonik commented 10 years ago

Reopening - looks like my merge-up of trunk closed this accidentally.

joel-bernstein commented 10 years ago

Added initial test case:

https://github.com/joelbernstein2013/heliosearch/commit/2fb72783e2094958b7ca7d678efa9011babf00c7

joel-bernstein commented 10 years ago

Added a few more tests to cover the basic functionality.

https://github.com/joelbernstein2013/heliosearch/commit/a4b688a20d81118a8e87e5ba2bff8cc4125ebd71

My plan now is to add the distributed test cases and test it at scale and then I think this is nearing initial release condition.

Kranti has a few more features he'd like to add (group level paging, subgroup support ) and we can iterate further on these.

joel-bernstein commented 10 years ago

Added basic distributed test cases. https://github.com/joelbernstein2013/heliosearch/commit/a9e0b4e8e9aabc072177cb3f1b5b363e3619dbfa

Also a small formatting update:https://github.com/joelbernstein2013/heliosearch/commit/c7b61a971e263edcefa5ceabed0b4dd85d8a6214

Also did some performance testing at scale and the Expand component seems to perform at about the same speed as the CollapsingQParserPlugin. So performing a collapse and expand takes about twice as much time as doing only the collapse.