Closed asfimport closed 13 years ago
Martijn van Groningen (@martijnvg) (migrated from JIRA)
All the current group collectors use currently DocTermsIndex. With function queries I think this can't be used. However changing this requires changing the collectors in many places. So I think the getGroupKey() method isn't enough. Solr uses the MutableValue to abstract away the source of the values. I'm not sure what we can use in Lucene to have this abstraction.
Michael McCandless (@mikemccand) (migrated from JIRA)
I was thinking we could parameterize the type of the group key for these collectors, ie today it's BytesRef, but Solr would use MutableValue instead, and would subclass to override the getGroupKey (and presumably also setNextReader). Not sure it'll all work though until we try it... I agree there are alot of places that need to be touched.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
In order to avoid double FC usage we need to let the grouping collectors instantiate the right fieldcache for the right field. The current collectors always create a StringIndex (3.x) or DocTermsIndex (4.0). When you sort on the same field as you group and this field is for example an int. Then a complete new StringIndex / DocTermsIndex is put in the FieldCache.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Attached an initial idea of abstracting away the source of the group values.
In the patch there are four concepts:
I think with this infrastructure it is quite straight forward to add the ability to group by a function. The patch contains implementations for all fields (string, int, double etc.).
Some concepts are look like with what is already in Lucene / Solr. For example Lucene's ValueSource and DocValues or Solr's DocValues and MutableValue. I just started from scratch to see what grouping really needs. The Lucene's DocValues for example didn't have all functionality grouping needs.
Furthermore I have added research group collectors. I ported the MatchAllGroupsCollector and FirstPassGroupingCollector to use this new infrastructure. Just to get a feeling of how it all fits together.
I also included a simple runner class that runs these research collectors and compares them to the variants already added to the code base. You can then easily check search times and the group results.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Also some initial performance indications. This data was created with the runner provided in the patch. A ran it on a index of 30M documents that has 7502 unique groups.
===== First pass collectors execution =====
number |collector |query |time (ms) |Current mem usage (MB)
1 |FirstPassGroupingCollector |country:es |888 |170
2 |ResearchFirstPassGroupingCollector |country:es |710 |170
3 |FirstPassGroupingCollector |country:es country:eg |1448 |170
4 |ResearchFirstPassGroupingCollector |country:es country:eg |1350 |170
5 |FirstPassGroupingCollector |country:it |2 |170
6 |ResearchFirstPassGroupingCollector |country:it |3 |170
7 |FirstPassGroupingCollector |country:gr |45 |170
8 |ResearchFirstPassGroupingCollector |country:gr |43 |170
9 |FirstPassGroupingCollector |country:it |3 |170
10 |ResearchFirstPassGroupingCollector |country:it |3 |170
11 |FirstPassGroupingCollector |country:it |2 |170
12 |ResearchFirstPassGroupingCollector |country:it |3 |170
13 |FirstPassGroupingCollector |country:eg |67 |170
14 |ResearchFirstPassGroupingCollector |country:eg |91 |171
15 |FirstPassGroupingCollector |country:es country:eg |1387 |171
16 |ResearchFirstPassGroupingCollector |country:es country:eg |1351 |171
17 |FirstPassGroupingCollector |country:e* |2264 |173
18 |ResearchFirstPassGroupingCollector |country:e* |1036 |176
19 |FirstPassGroupingCollector |country:it |2 |176
20 |ResearchFirstPassGroupingCollector |country:it |3 |176
21 |FirstPassGroupingCollector |*:* |526 |176
22 |ResearchFirstPassGroupingCollector |*:* |410 |176
23 |FirstPassGroupingCollector |country:gr |50 |177
24 |ResearchFirstPassGroupingCollector |country:gr |45 |177
25 |FirstPassGroupingCollector |country:es country:eg |1397 |177
26 |ResearchFirstPassGroupingCollector |country:es country:eg |1362 |177
27 |FirstPassGroupingCollector |*:* |544 |177
28 |ResearchFirstPassGroupingCollector |*:* |411 |177
29 |FirstPassGroupingCollector |country:tr |54 |177
30 |ResearchFirstPassGroupingCollector |country:tr |52 |177
31 |FirstPassGroupingCollector |*:* |531 |177
32 |ResearchFirstPassGroupingCollector |*:* |413 |177
33 |FirstPassGroupingCollector |*:* |534 |177
34 |ResearchFirstPassGroupingCollector |*:* |416 |177
35 |FirstPassGroupingCollector |country:eg |67 |177
36 |ResearchFirstPassGroupingCollector |country:eg |64 |177
37 |FirstPassGroupingCollector |country:gr |45 |177
38 |ResearchFirstPassGroupingCollector |country:gr |47 |177
39 |FirstPassGroupingCollector |country:e* |1109 |181
40 |ResearchFirstPassGroupingCollector |country:e* |965 |185
41 |FirstPassGroupingCollector |country:us |56 |185
42 |ResearchFirstPassGroupingCollector |country:us |62 |185
43 |FirstPassGroupingCollector |country:gr |44 |185
44 |ResearchFirstPassGroupingCollector |country:gr |42 |185
45 |FirstPassGroupingCollector |country:es |872 |185
46 |ResearchFirstPassGroupingCollector |country:es |723 |185
47 |FirstPassGroupingCollector |country:es |804 |185
48 |ResearchFirstPassGroupingCollector |country:es |717 |185
49 |FirstPassGroupingCollector |country:es |747 |185
50 |ResearchFirstPassGroupingCollector |country:es |723 |185
===== All groups collectors execution =====
number |collector |query |time (ms) |count |Current mem usage (MB)
1 |AllGroupsCollector |country:us |54 |11 |185
2 |ResearchAllGroupsCollector |country:us |53 |11 |169
3 |AllGroupsCollector |country:it |2 |9 |169
4 |ResearchAllGroupsCollector |country:it |2 |9 |169
5 |AllGroupsCollector |country:tr |64 |1700 |170
6 |ResearchAllGroupsCollector |country:tr |67 |1700 |170
7 |AllGroupsCollector |*:* |454 |7502 |170
8 |ResearchAllGroupsCollector |*:* |471 |7502 |170
9 |AllGroupsCollector |country:tr |51 |1700 |171
10 |ResearchAllGroupsCollector |country:tr |61 |1700 |171
11 |AllGroupsCollector |country:es |717 |3012 |171
12 |ResearchAllGroupsCollector |country:es |755 |3012 |171
13 |AllGroupsCollector |country:es |691 |3012 |172
14 |ResearchAllGroupsCollector |country:es |765 |3012 |172
15 |AllGroupsCollector |*:* |411 |7502 |172
16 |ResearchAllGroupsCollector |*:* |462 |7502 |173
17 |AllGroupsCollector |country:es |685 |3012 |173
18 |ResearchAllGroupsCollector |country:es |743 |3012 |173
19 |AllGroupsCollector |country:eg |57 |646 |173
20 |ResearchAllGroupsCollector |country:eg |63 |646 |173
21 |AllGroupsCollector |country:tr |51 |1700 |173
22 |ResearchAllGroupsCollector |country:tr |54 |1700 |174
23 |AllGroupsCollector |country:it |2 |9 |174
24 |ResearchAllGroupsCollector |country:it |3 |9 |174
25 |AllGroupsCollector |*:* |433 |7502 |174
26 |ResearchAllGroupsCollector |*:* |709 |7502 |175
27 |AllGroupsCollector |country:tr |50 |1700 |175
28 |ResearchAllGroupsCollector |country:tr |59 |1700 |175
29 |AllGroupsCollector |country:gr |44 |2061 |175
30 |ResearchAllGroupsCollector |country:gr |43 |2061 |175
31 |AllGroupsCollector |country:us |17 |11 |176
32 |ResearchAllGroupsCollector |country:us |27 |11 |176
33 |AllGroupsCollector |country:tr |55 |1700 |176
34 |ResearchAllGroupsCollector |country:tr |55 |1700 |176
35 |AllGroupsCollector |country:e* |941 |3658 |180
36 |ResearchAllGroupsCollector |country:e* |1087 |3658 |184
37 |AllGroupsCollector |country:es |706 |3012 |184
38 |ResearchAllGroupsCollector |country:es |747 |3012 |184
39 |AllGroupsCollector |country:eg |57 |646 |185
40 |ResearchAllGroupsCollector |country:eg |61 |646 |185
41 |AllGroupsCollector |country:es country:eg |1507 |3658 |170
42 |ResearchAllGroupsCollector |country:es country:eg |1512 |3658 |171
43 |AllGroupsCollector |country:gr |38 |2061 |171
44 |ResearchAllGroupsCollector |country:gr |41 |2061 |172
45 |AllGroupsCollector |country:us |17 |11 |172
46 |ResearchAllGroupsCollector |country:us |19 |11 |172
47 |AllGroupsCollector |country:es |708 |3012 |172
48 |ResearchAllGroupsCollector |country:es |742 |3012 |172
49 |AllGroupsCollector |country:us |17 |11 |172
50 |ResearchAllGroupsCollector |country:us |19 |11 |173
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch, with a possible more minimal approach to enabling Solr trunk to cutover to the grouping module.
Michael McCandless (@mikemccand) (migrated from JIRA)
That patch is totally untested, and has at least 2 generics warnings! But hopefully the approach can work...
Basically the idea of the 2nd patch is to just make minimal extensions points to the current grouping collectors, so that Solr could subclass these and use its MutableValue/DocValues to manage the group keys. I think this would then mean Solr trunk and Solr 3.x could fully cutover and not lose any functionality (and gain the benefits of the grouping module).
Separately, we should merge the cool GroupValue, GroupValueSource, GroupHolder, etc., from the first patch here, with Solr's equivalents, factored out I think into a shared "common" module that the FQ module (#3957) can also use.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Attached a new patch that is based on Mike's patch.
Only the TermSecondPassGroupingCollector didn't work. The size groupDocs array was too small.
I think the following things need to be done:
Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Separately, we should merge the cool GroupValue, GroupValueSource, GroupHolder, etc., from the first patch here, with Solr's equivalents, factored out I think into a shared "common" module that the FQ module (#3957) can also use.
+1! We need to find out what fq and grouping really needs under the hood. Performance can not be harmed by moving this into common module. You think that we already should open a new issue for this?
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Attached an updated patch.
I'm currently busy with integrating the grouping module in trunk Solr. I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector. Also made a small change in GroupDocs regarding generics.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
I think the SecondPassGroupingCollector class needs the following two methods instead of getDocSlot:
public void collect(int doc) throws IOException {
totalHitCount++;
if (gatherGroupedDocs(doc)) {
retrieveGroup(doc).collector.collect(doc);
}
}
protected abstract boolean gatherGroupedDocs(int doc) throws IOException;
protected abstract SearchGroupDocs<GROUP_VALUE_TYPE> retrieveGroup(int doc) throws IOException;
When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex). This setup works nicely in both impls. Downside is that for the Terms impl the ord key has to be looked up twice when gatherGroupedDocs(..) returns true.
Michael McCandless (@mikemccand) (migrated from JIRA)
Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract.
+1
You think that we already should open a new issue for this?
I think we can take this up under #3957? There's already an initial patch there, starting to factor out Mutable*...
Should we name it TermAllGroupsCollector (instead of Terms...)? Or, fix the others to be TermsFirst/SecondPassCollector?
I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector
Ahh, excellent – these abstract base classes don't need to know it's a feild we are grouping on.
When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex).
OK I agree.
Maybe, instead, we can have only retrieveDocGroup? And if it returns null that means the group for this doc isn't being collected? Then we don't double the ord lookup for TermSecondPass.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
I think we can take this up under #3957?
I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken.
Should we name it TermAllGroupsCollector
+1 I'll rename it
Maybe, instead, we can have only retrieveDocGroup? And if it returns null that means the group for this doc isn't being collected? Then we don't double the ord lookup for TermSecondPass.
That seems reasonable to me. That also saves a key lookup for the TermSecondPassGroupingCollector.
Michael McCandless (@mikemccand) (migrated from JIRA)
I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken.
Right #3957 is about FQ, but I'm thinking we should simply "birth" the commons module under LUCEEN-2883, meeting only the needs of FQ, initially.
After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch.
That makes sense to me.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Attached a new patch with the discussed changes. The AbstractMatchAllGroupCollector changed a lot. Most code is pushed to implementation classes. During development of fq impl I noticed that the abstraction was still too specific for terms impl.
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Previous patch was wrong. Here a new one.
Michael McCandless (@mikemccand) (migrated from JIRA)
I think we are close here! It's great to see you're able to cutover Solr trunk to the grouping module based on this.
Random things:
I think we can in fact use Map<...> instead of HashMap<...> in 2nd pass abstract collector?
Can you add @SuppressWarnings("unchecked") for the generic array creations?
Can you fix the other generics warnings? Eg the copy-ctor in TopGroups, and TestGrouping has a few warnings. ("ant clean compile-test" should show all these warnings).
The class in AllGroupsCollectorTest needs to be renamed
OK, let's leave groupDocs as protected in the 2nd pass collector (remove the nocommit/your response).
For AbstractAllGroupsCollector, can we impl the getGroupCount in the base class as getGroups.size()?
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Attached an updated version of the patch with the Mike's points. When building the module no generic warnings occur any more in the ant build. I also updated some documentation.
For AbstractAllGroupsCollector, can we impl the getGroupCount in the base class as getGroups.size()?
Certainly! Much better. Since the method is syntactical sugar anyway.
Michael McCandless (@mikemccand) (migrated from JIRA)
Patch looks great – I think it's ready to commit!
Looks like #4170 snuck into the patch (AllGroupHeadsCollector).
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Looks like #4170 snuck into the patch (AllGroupHeadsCollector).
oops... Yes it did. My change lists collide. Maybe it is time use git...
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Updated patch to trunk. Previous patch can't be applied on the current trunk without issues.
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks Martijn!
Robert Muir (@rmuir) (migrated from JIRA)
bulk close for 3.3
The new grouping module can only group by a single-valued indexed field.
But, if we make the 'getGroupKey' a method that a subclass could override, then I think we could refactor Solr over to the module, because it could do function queries and normal queries via subclass (I think).
This also makes the impl more extensible to apps that might have their own interesting group values per document.
Migrated from LUCENE-3099 by Michael McCandless (@mikemccand), resolved Jun 03 2011 Attachments: LUCENE-3099.patch (versions: 8) Linked issues: