apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

EarlyTerminatingSortingCollector support for grouped searches [LUCENE-7341] #8395

Open asfimport opened 8 years ago

asfimport commented 8 years ago

Currently grouped searches must not use the early terminating sorting collector because the wrong results would be returned. This ticket proposes to change the EarlyTerminatingSortingCollector class and probably the LeafCollector interface to support early termination for grouped searches.

Illustration (aaa is an as yet unnamed boolean flag):

# fictitious (sorted by X) segment
| doc key | D0 D1 D2 D3 D4 D5 ... D10 D11 D12 D13 D14 D15 ... D20 D21 D22 D23 D24 D25 ...
| doc grp | G0 G0 G0 G0 G0 G0 ... D10 G10 G10 G10 G10 G10 ... G20 G20 G20 G20 G20 G20 ... 
# query with rows=3 sort=X group=false
| query result | D0 D1 D2

# existing code:
#   use a EarlyTerminatingSortingCollector with numDocsToCollect=3
#   EarlyTerminatingSortingCollector.getLeafCollector returns a LeafCollector
#   whose collect method uses (++numCollected >= numDocsToCollect) as the terminating condition
# query with rows=3 sort=X group=true group.field=grp group.sort=X group.limit=1
| query result | G0(D0) G10(D10) G20(D20)

# existing code:
#   cannot use EarlyTerminatingSortingCollector (query result would wrongly be just 'G0(D0)')
# proposed new code:
#   use a EarlyTerminatingSortingCollector(... numDocsToCollect=3 aaa=true)
# query with rows=3 sort=X group=true group.field=grp group.sort=X group.limit=5
| query result | G0(D0,D1,D2,D3,D4) G10(D10,D11,D12,D13,D14) G20(D20,D21,D22,D23,D24)

# existing code:
#   cannot use EarlyTerminatingSortingCollector (query result would wrongly be just 'G0(D0,D1,D2)')
# proposed new code:
#   use a EarlyTerminatingSortingCollector(... numDocsToCollect=3 aaa=true)

Migrated from LUCENE-7341 by Christine Poerschke (@cpoerschke), 1 vote, updated Jun 17 2016

asfimport commented 8 years ago

Christine Poerschke (@cpoerschke) (migrated from JIRA)

proposed EarlyTerminatingSortingCollector constructor change:

+ final private boolean aaa; // as yet unnamed flag
+ `@Deprecated`
  public EarlyTerminatingSortingCollector(Collector in, Sort sort, int numDocsToCollect) {
-   ...
+   this(in, sort, numDocsToCollect, false);
+ }
+ public EarlyTerminatingSortingCollector(Collector in, Sort sort, int numDocsToCollect, boolean aaa) {
+   ...
    this.aaa = aaa;
  }

proposed EarlyTerminatingSortingCollector method change:

  public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
    ...
    if (...) {
      // segment is sorted, can early-terminate
      return new FilterLeafCollector(super.getLeafCollector(context)) {
        private int numCollected;
        `@Override`
        public void collect(int doc) throws IOException {
          super.collect(doc);
-         if (++numCollected >= numDocsToCollect) {
+         if (aaa) {
+           final Boolean zzz = gotAndKeptAtLeast(numDocsToCollect);
+           if (Boolean.TRUE.equals(zzz) {
+             terminatedEarly.set(true);
+             throw new CollectionTerminatedException();+
+           }
+         } else if (++numCollected >= numDocsToCollect) {
            terminatedEarly.set(true);
            throw new CollectionTerminatedException();
          }
        }
      };
    } else {
      return super.getLeafCollector(context);
    }
  }

proposed LeafCollector interface extension:

public interface LeafCollector {
  ...
   // Return null to indicate "don't know".
  Boolean gotAndKeptAtLeast(int numDocs);
  ...
}

outline AbstractFirstPassGroupingCollector change:

public abstract class AbstractFirstPassGroupingCollector ... {
  ...
  Boolean gotAndKeptAtLeast(int numDocs) {
    return (groupMap == null ? null : (groupMap.size() >= numDocs)); 
  }
  ...
}

outline AbstractSecondPassGroupingCollector change:

public abstract class AbstractSecondPassGroupingCollector ... {
  ...
  Boolean gotAndKeptAtLeast(int numDocs) {
    Boolean gak = null;
    for (SearchGroupDocs<GROUP_VALUE_TYPE> groupDocs : groupMap.values()) {
      gak = groupDocs.leafCollector.gotAndKeptAtLeast(maxDocsPerGroup);
      if (!Boolean.TRUE.equals(gak)) {
        return gak;
      }
    }
    return gak;
  }
  ...
}

I think something like this should work, what do you think?

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's a little spooky we need to make so many API changes to so many collectors ... can we maybe instead just have a custom collector that knows how to early terminate specifically for grouping collectors?