apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Grouping module should allow subclasses to set the group key per document [LUCENE-3099] #4172

Closed asfimport closed 13 years ago

asfimport commented 13 years ago

The new grouping module can only group by a single-valued indexed field.

But, if we make the 'getGroupKey' a method that a subclass could override, then I think we could refactor Solr over to the module, because it could do function queries and normal queries via subclass (I think).

This also makes the impl more extensible to apps that might have their own interesting group values per document.


Migrated from LUCENE-3099 by Michael McCandless (@mikemccand), resolved Jun 03 2011 Attachments: LUCENE-3099.patch (versions: 8) Linked issues:

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

All the current group collectors use currently DocTermsIndex. With function queries I think this can't be used. However changing this requires changing the collectors in many places. So I think the getGroupKey() method isn't enough. Solr uses the MutableValue to abstract away the source of the values. I'm not sure what we can use in Lucene to have this abstraction.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I was thinking we could parameterize the type of the group key for these collectors, ie today it's BytesRef, but Solr would use MutableValue instead, and would subclass to override the getGroupKey (and presumably also setNextReader). Not sure it'll all work though until we try it... I agree there are alot of places that need to be touched.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

In order to avoid double FC usage we need to let the grouping collectors instantiate the right fieldcache for the right field. The current collectors always create a StringIndex (3.x) or DocTermsIndex (4.0). When you sort on the same field as you group and this field is for example an int. Then a complete new StringIndex / DocTermsIndex is put in the FieldCache.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Attached an initial idea of abstracting away the source of the group values.

In the patch there are four concepts:

I think with this infrastructure it is quite straight forward to add the ability to group by a function. The patch contains implementations for all fields (string, int, double etc.).

Some concepts are look like with what is already in Lucene / Solr. For example Lucene's ValueSource and DocValues or Solr's DocValues and MutableValue. I just started from scratch to see what grouping really needs. The Lucene's DocValues for example didn't have all functionality grouping needs.

Furthermore I have added research group collectors. I ported the MatchAllGroupsCollector and FirstPassGroupingCollector to use this new infrastructure. Just to get a feeling of how it all fits together.

I also included a simple runner class that runs these research collectors and compares them to the variants already added to the code base. You can then easily check search times and the group results.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Also some initial performance indications. This data was created with the runner provided in the patch. A ran it on a index of 30M documents that has 7502 unique groups.

===== First pass collectors execution =====
number   |collector                               |query                   |time (ms)   |Current mem usage (MB)  
1        |FirstPassGroupingCollector              |country:es              |888         |170                     
2        |ResearchFirstPassGroupingCollector      |country:es              |710         |170                     
3        |FirstPassGroupingCollector              |country:es country:eg   |1448        |170                      
4        |ResearchFirstPassGroupingCollector      |country:es country:eg   |1350        |170                      
5        |FirstPassGroupingCollector              |country:it              |2           |170                    
6        |ResearchFirstPassGroupingCollector      |country:it              |3           |170                     
7        |FirstPassGroupingCollector              |country:gr              |45          |170                      
8        |ResearchFirstPassGroupingCollector      |country:gr              |43          |170                    
9        |FirstPassGroupingCollector              |country:it              |3           |170                    
10       |ResearchFirstPassGroupingCollector      |country:it              |3           |170                    
11       |FirstPassGroupingCollector              |country:it              |2           |170                    
12       |ResearchFirstPassGroupingCollector      |country:it              |3           |170                    
13       |FirstPassGroupingCollector              |country:eg              |67          |170                    
14       |ResearchFirstPassGroupingCollector      |country:eg              |91          |171                     
15       |FirstPassGroupingCollector              |country:es country:eg   |1387        |171                    
16       |ResearchFirstPassGroupingCollector      |country:es country:eg   |1351        |171                     
17       |FirstPassGroupingCollector              |country:e*              |2264        |173                    
18       |ResearchFirstPassGroupingCollector      |country:e*              |1036        |176                     
19       |FirstPassGroupingCollector              |country:it              |2           |176                     
20       |ResearchFirstPassGroupingCollector      |country:it              |3           |176                      
21       |FirstPassGroupingCollector              |*:*                     |526         |176                     
22       |ResearchFirstPassGroupingCollector      |*:*                     |410         |176                     
23       |FirstPassGroupingCollector              |country:gr              |50          |177                      
24       |ResearchFirstPassGroupingCollector      |country:gr              |45          |177                      
25       |FirstPassGroupingCollector              |country:es country:eg   |1397        |177                     
26       |ResearchFirstPassGroupingCollector      |country:es country:eg   |1362        |177                     
27       |FirstPassGroupingCollector              |*:*                     |544         |177                      
28       |ResearchFirstPassGroupingCollector      |*:*                     |411         |177                    
29       |FirstPassGroupingCollector              |country:tr              |54          |177                     
30       |ResearchFirstPassGroupingCollector      |country:tr              |52          |177                      
31       |FirstPassGroupingCollector              |*:*                     |531         |177                    
32       |ResearchFirstPassGroupingCollector      |*:*                     |413         |177                     
33       |FirstPassGroupingCollector              |*:*                     |534         |177                     
34       |ResearchFirstPassGroupingCollector      |*:*                     |416         |177                      
35       |FirstPassGroupingCollector              |country:eg              |67          |177                    
36       |ResearchFirstPassGroupingCollector      |country:eg              |64          |177                    
37       |FirstPassGroupingCollector              |country:gr              |45          |177                     
38       |ResearchFirstPassGroupingCollector      |country:gr              |47          |177                      
39       |FirstPassGroupingCollector              |country:e*              |1109        |181                     
40       |ResearchFirstPassGroupingCollector      |country:e*              |965         |185                      
41       |FirstPassGroupingCollector              |country:us              |56          |185                      
42       |ResearchFirstPassGroupingCollector      |country:us              |62          |185                      
43       |FirstPassGroupingCollector              |country:gr              |44          |185                      
44       |ResearchFirstPassGroupingCollector      |country:gr              |42          |185                      
45       |FirstPassGroupingCollector              |country:es              |872         |185                      
46       |ResearchFirstPassGroupingCollector      |country:es              |723         |185                     
47       |FirstPassGroupingCollector              |country:es              |804         |185                      
48       |ResearchFirstPassGroupingCollector      |country:es              |717         |185                      
49       |FirstPassGroupingCollector              |country:es              |747         |185                      
50       |ResearchFirstPassGroupingCollector      |country:es              |723         |185                      

===== All groups collectors execution =====
number   |collector                     |query                   |time (ms)   |count       |Current mem usage (MB)
1        |AllGroupsCollector            |country:us              |54          |11          |185         
2        |ResearchAllGroupsCollector    |country:us              |53          |11          |169         
3        |AllGroupsCollector            |country:it              |2           |9           |169         
4        |ResearchAllGroupsCollector    |country:it              |2           |9           |169         
5        |AllGroupsCollector            |country:tr              |64          |1700        |170         
6        |ResearchAllGroupsCollector    |country:tr              |67          |1700        |170         
7        |AllGroupsCollector            |*:*                     |454         |7502        |170         
8        |ResearchAllGroupsCollector    |*:*                     |471         |7502        |170         
9        |AllGroupsCollector            |country:tr              |51          |1700        |171         
10       |ResearchAllGroupsCollector    |country:tr              |61          |1700        |171         
11       |AllGroupsCollector            |country:es              |717         |3012        |171         
12       |ResearchAllGroupsCollector    |country:es              |755         |3012        |171         
13       |AllGroupsCollector            |country:es              |691         |3012        |172         
14       |ResearchAllGroupsCollector    |country:es              |765         |3012        |172         
15       |AllGroupsCollector            |*:*                     |411         |7502        |172         
16       |ResearchAllGroupsCollector    |*:*                     |462         |7502        |173         
17       |AllGroupsCollector            |country:es              |685         |3012        |173         
18       |ResearchAllGroupsCollector    |country:es              |743         |3012        |173         
19       |AllGroupsCollector            |country:eg              |57          |646         |173         
20       |ResearchAllGroupsCollector    |country:eg              |63          |646         |173         
21       |AllGroupsCollector            |country:tr              |51          |1700        |173         
22       |ResearchAllGroupsCollector    |country:tr              |54          |1700        |174         
23       |AllGroupsCollector            |country:it              |2           |9           |174         
24       |ResearchAllGroupsCollector    |country:it              |3           |9           |174         
25       |AllGroupsCollector            |*:*                     |433         |7502        |174         
26       |ResearchAllGroupsCollector    |*:*                     |709         |7502        |175         
27       |AllGroupsCollector            |country:tr              |50          |1700        |175         
28       |ResearchAllGroupsCollector    |country:tr              |59          |1700        |175         
29       |AllGroupsCollector            |country:gr              |44          |2061        |175         
30       |ResearchAllGroupsCollector    |country:gr              |43          |2061        |175         
31       |AllGroupsCollector            |country:us              |17          |11          |176         
32       |ResearchAllGroupsCollector    |country:us              |27          |11          |176         
33       |AllGroupsCollector            |country:tr              |55          |1700        |176         
34       |ResearchAllGroupsCollector    |country:tr              |55          |1700        |176         
35       |AllGroupsCollector            |country:e*              |941         |3658        |180         
36       |ResearchAllGroupsCollector    |country:e*              |1087        |3658        |184         
37       |AllGroupsCollector            |country:es              |706         |3012        |184         
38       |ResearchAllGroupsCollector    |country:es              |747         |3012        |184         
39       |AllGroupsCollector            |country:eg              |57          |646         |185         
40       |ResearchAllGroupsCollector    |country:eg              |61          |646         |185         
41       |AllGroupsCollector            |country:es country:eg   |1507        |3658        |170         
42       |ResearchAllGroupsCollector    |country:es country:eg   |1512        |3658        |171         
43       |AllGroupsCollector            |country:gr              |38          |2061        |171         
44       |ResearchAllGroupsCollector    |country:gr              |41          |2061        |172         
45       |AllGroupsCollector            |country:us              |17          |11          |172         
46       |ResearchAllGroupsCollector    |country:us              |19          |11          |172         
47       |AllGroupsCollector            |country:es              |708         |3012        |172         
48       |ResearchAllGroupsCollector    |country:es              |742         |3012        |172         
49       |AllGroupsCollector            |country:us              |17          |11          |172         
50       |ResearchAllGroupsCollector    |country:us              |19          |11          |173         
asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, with a possible more minimal approach to enabling Solr trunk to cutover to the grouping module.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

That patch is totally untested, and has at least 2 generics warnings! But hopefully the approach can work...

Basically the idea of the 2nd patch is to just make minimal extensions points to the current grouping collectors, so that Solr could subclass these and use its MutableValue/DocValues to manage the group keys. I think this would then mean Solr trunk and Solr 3.x could fully cutover and not lose any functionality (and gain the benefits of the grouping module).

Separately, we should merge the cool GroupValue, GroupValueSource, GroupHolder, etc., from the first patch here, with Solr's equivalents, factored out I think into a shared "common" module that the FQ module (#3957) can also use.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Attached a new patch that is based on Mike's patch.

Only the TermSecondPassGroupingCollector didn't work. The size groupDocs array was too small.

I think the following things need to be done:

Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Separately, we should merge the cool GroupValue, GroupValueSource, GroupHolder, etc., from the first patch here, with Solr's equivalents, factored out I think into a shared "common" module that the FQ module (#3957) can also use.

+1! We need to find out what fq and grouping really needs under the hood. Performance can not be harmed by moving this into common module. You think that we already should open a new issue for this?

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Attached an updated patch.

I'm currently busy with integrating the grouping module in trunk Solr. I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector. Also made a small change in GroupDocs regarding generics.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

I think the SecondPassGroupingCollector class needs the following two methods instead of getDocSlot:

  public void collect(int doc) throws IOException {
    totalHitCount++;
    if (gatherGroupedDocs(doc)) {
      retrieveGroup(doc).collector.collect(doc);
    }
  }

  protected abstract boolean gatherGroupedDocs(int doc) throws IOException;

  protected abstract SearchGroupDocs<GROUP_VALUE_TYPE> retrieveGroup(int doc) throws IOException;

When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex). This setup works nicely in both impls. Downside is that for the Terms impl the ord key has to be looked up twice when gatherGroupedDocs(..) returns true.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Should we rename all abstract collectors to Abstract*? To make it clear that these classes are abstract.

+1

You think that we already should open a new issue for this?

I think we can take this up under #3957? There's already an initial patch there, starting to factor out Mutable*...

Should we name it TermAllGroupsCollector (instead of Terms...)? Or, fix the others to be TermsFirst/SecondPassCollector?

I noticed that FirstPassGroupingCollector and SecondPassGroupingCollector still has groupField as field and constructor argument. So I moved this to TermsFirstPassGroupingCollector and TermSecondPassGroupingCollector

Ahh, excellent – these abstract base classes don't need to know it's a feild we are grouping on.

When working with fqs the slot is not practical, since there is no ords (like DocTermsIndex).

OK I agree.

Maybe, instead, we can have only retrieveDocGroup? And if it returns null that means the group for this doc isn't being collected? Then we don't double the ord lookup for TermSecondPass.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

I think we can take this up under #3957?

I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken.

Should we name it TermAllGroupsCollector

+1 I'll rename it

Maybe, instead, we can have only retrieveDocGroup? And if it returns null that means the group for this doc isn't being collected? Then we don't double the ord lookup for TermSecondPass.

That seems reasonable to me. That also saves a key lookup for the TermSecondPassGroupingCollector.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I thought that 2883 was more about integrating fq, not a general module for value source and docvalues that both grouping and fq can share. But I might have been mistaken.

Right #3957 is about FQ, but I'm thinking we should simply "birth" the commons module under LUCEEN-2883, meeting only the needs of FQ, initially.

After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

After FQ is relatively done/stabilized, then we should look into grouping offering collectors for all filed types, using the infrastructure from that common module, and extending the common module at that time if there are needs of grouping not yet met, ie merging in the GroupValue, GroupValueSource, GroupHolder from this patch.

That makes sense to me.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Attached a new patch with the discussed changes. The AbstractMatchAllGroupCollector changed a lot. Most code is pushed to implementation classes. During development of fq impl I noticed that the abstraction was still too specific for terms impl.

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Previous patch was wrong. Here a new one.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we are close here! It's great to see you're able to cutover Solr trunk to the grouping module based on this.

Random things:

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Attached an updated version of the patch with the Mike's points. When building the module no generic warnings occur any more in the ant build. I also updated some documentation.

For AbstractAllGroupsCollector, can we impl the getGroupCount in the base class as getGroups.size()?

Certainly! Much better. Since the method is syntactical sugar anyway.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks great – I think it's ready to commit!

Looks like #4170 snuck into the patch (AllGroupHeadsCollector).

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Looks like #4170 snuck into the patch (AllGroupHeadsCollector).

oops... Yes it did. My change lists collide. Maybe it is time use git...

asfimport commented 13 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Updated patch to trunk. Previous patch can't be applied on the current trunk without issues.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Martijn!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

bulk close for 3.3