apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.74k stars 1.04k forks source link

Paging with SortingMergePolicy and EarlyTerminatingSortingCollector [LUCENE-7255] #8310

Open asfimport opened 8 years ago

asfimport commented 8 years ago

EarlyTerminatingSortingCollector seems to don't work when used with a TopDocsCollector searching for documents after a certain FieldDoc. That is, it can't be used for paging. The following code allows to reproduce the problem:

// Sort to be used both with merge policy and queries
Sort sort = new Sort(new SortedNumericSortField(FIELD_NAME, SortField.Type.INT));

// Create directory
RAMDirectory directory = new RAMDirectory();

// Setup merge policy
TieredMergePolicy tieredMergePolicy = new TieredMergePolicy();
SortingMergePolicy sortingMergePolicy = new SortingMergePolicy(tieredMergePolicy, sort);

// Setup index writer
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new SimpleAnalyzer());
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
indexWriterConfig.setMergePolicy(sortingMergePolicy);
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

// Index values
for (int i = 1; i <= 1000; i++) {
    Document document = new Document();
    document.add(new NumericDocValuesField(FIELD_NAME, i));
    indexWriter.addDocument(document);
}

// Force index merge to ensure early termination
indexWriter.forceMerge(1, true);
indexWriter.commit();

// Create index searcher
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

// Paginated read
int pageSize = 10;
FieldDoc pageStart = null;
while (true) {

    logger.info("Collecting page starting at: {}", pageStart);

    Query query = new MatchAllDocsQuery();

    TopDocsCollector tfc = TopFieldCollector.create(sort, pageSize, pageStart, true, false, false);
    EarlyTerminatingSortingCollector collector = new EarlyTerminatingSortingCollector(tfc, sort, pageSize, sort);
    searcher.search(query, collector);
    ScoreDoc[] scoreDocs = tfc.topDocs().scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {
        pageStart = (FieldDoc) scoreDoc;
        logger.info("FOUND {}", scoreDoc);
    }

    logger.info("Terminated early: {}", collector.terminatedEarly());

    if (scoreDocs.length < pageSize) break;
}

// Close
reader.close();
indexWriter.close();
directory.close();

The query for the second page doesn't return any results. However, it gets the expected results when if we don't wrap the TopFieldCollector with the EarlyTerminatingSortingCollector.


Migrated from LUCENE-7255 by Andres de la Peña, 1 vote, updated Apr 28 2016 Attachments: LUCENE-7255_v0.diff

asfimport commented 8 years ago

Christine Poerschke (@cpoerschke) (migrated from JIRA)

Hi Andres de la Peña - thanks for the report and code to reproduce.

I haven't yet tried to add your code to the existing TestEarlyTerminatingSortingCollector tests but from just reading the code think that a higher numDocsToCollect argument might need to be passed to the EarlyTerminatingSortingCollector constructor i.e.

- EarlyTerminatingSortingCollector collector = new EarlyTerminatingSortingCollector(tfc, sort, pageSize, sort);
+ EarlyTerminatingSortingCollector collector = new EarlyTerminatingSortingCollector(tfc, sort, numToSkip+pageSize, sort);

Is that something you could try out when you have a few minutes? This could make a good additional TestEarlyTerminatingSortingCollector test I think.

asfimport commented 8 years ago

Andres de la Peña (migrated from JIRA)

Hi @cpoerschke,

You are totally right, the test works perfectly passing the total number of hits:

// Sort to be used both with merge policy and queries
Sort sort = new Sort(new SortedNumericSortField(FIELD_NAME, SortField.Type.INT));

// Create directory
RAMDirectory directory = new RAMDirectory();

// Setup merge policy
TieredMergePolicy tieredMergePolicy = new TieredMergePolicy();
SortingMergePolicy sortingMergePolicy = new SortingMergePolicy(tieredMergePolicy, sort);

// Setup index writer
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new SimpleAnalyzer());
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
indexWriterConfig.setMergePolicy(sortingMergePolicy);
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

// Index values
for (int i = 1; i <= 1000; i++) {
    Document document = new Document();
    document.add(new NumericDocValuesField(FIELD_NAME, i));
    indexWriter.addDocument(document);
}

// Force index merge to ensure early termination
indexWriter.forceMerge(1, true);
indexWriter.commit();

// Create index searcher
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

// Initialize paging state
int numFound = 0;
FieldDoc pageStart = null;

// Paginated read
while (true) {

    logger.info("Collecting page starting at: {}", pageStart);

    Query query = new MatchAllDocsQuery();

    TopDocsCollector tfc = TopFieldCollector.create(sort, PAGE_SIZE, pageStart, true, false, false);
    Collector collector = new EarlyTerminatingSortingCollector(tfc, sort, numFound + PAGE_SIZE, sort);
    searcher.search(query, collector);
    ScoreDoc[] scoreDocs = tfc.topDocs().scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {

        logger.info("FOUND {}", scoreDoc);

        // Update paging state
        pageStart = (FieldDoc) scoreDoc;
        numFound++;
    }

    if (scoreDocs.length < PAGE_SIZE) break;
}

// Close
reader.close();
indexWriter.close();
directory.close();

Hi totally misunderstood the argument numDocsToCollect in EarlyTerminatingSortingCollector constructor :(

So the deep paging state of applications using a sorted index must be composed not only by the last FieldDoc but also by the number of already read documents. I guess that it is something to be taken into account by #7824, probably IndexSearcher.searchAfter(ScoreDoc after, Query query, int n, Sort sort) should become IndexSearcher.searchAfter(ScoreDoc after, int numToSkip, Query query, int n, Sort sort), or something similar, don't you think so? Do you think it would be possible to directly pass the page size to the EarlyTerminatingSortingCollector instead of the number of documents to collect?

I will be happy adding the paging test to TestEarlyTerminatingSortingCollector, if you want.

Thanks for your help.

asfimport commented 8 years ago

Christine Poerschke (@cpoerschke) (migrated from JIRA)

Do you think it would be possible to directly pass the page size to the {{EarlyTerminatingSortingCollector}} instead of the number of documents to collect?

Do you mean something like this i.e. a sort of convenience additional constructor?

- public EarlyTerminatingSortingCollector(Collector in, Sort sort, int numDocsToCollect, Sort mergePolicySort) {
-   ...
- }

+ public EarlyTerminatingSortingCollector(Collector in, Sort sort, int numDocsToCollect, Sort mergePolicySort) {
+   this(in, sort, 0, numDocsToCollect, mergePolicy);
+ }
+
+ public EarlyTerminatingSortingCollector(Collector in, Sort sort, int numToSkip, int numWanted, Sort mergePolicySort) {
+   ...
+ }

That sounds good to me. @shaie, @rmuir, @jpountz - what do you think?

I will be happy adding the paging test to {{TestEarlyTerminatingSortingCollector}}, if you want.

+1 for a paging test.

asfimport commented 8 years ago

Andres de la Peña (migrated from JIRA)

I was thinking in being able to build a EarlyTerminatingSortingCollector just with the number of wanted documents, either changing the meaning of the argument:

- public EarlyTerminatingSortingCollector(Collector in, Sort sort, int numDocsToCollect, Sort mergePolicySort) {
-   ...
- }

+ public EarlyTerminatingSortingCollector(Collector in, Sort sort, int numWanted, Sort mergePolicySort) {
+   ...
+ }

or maybe adding a new method:

+ public EarlyTerminatingSortingCollector buildWithWanted(Collector in, Sort sort, int numWanted, Sort mergePolicySort) {
+   ...
+ }

This way, if it is possible, the paging state managed by users would be composed only by the last FieldDoc, as it is done by other collectors. Otherwise, if I'm right, the paging state managed by users using sorted indexes should be composed by both the last FieldDoc and also the number of already collected documents, and an hypothetical IndexSearcher aware of index sorting such as the proposed by #7824 should modify its searchAfter method to require the number of documents to skip.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I like the API better as it is today since it is more explicit about how it is working. I suspect there might be some confusion about how searchAfter works in the non-sorted case: even though the last competitive document is provided, the query needs to visit all matches. The collector will just ignore documents that compare better than after since they were already returned on a previous page. Compared to regular pagination, this is better since we can use a priority queue of size size rather than from+size, but in both cases, all matching documents are collected.

We could make pagination work better in the case of sorted segments by tracking the last competitive document per segment rather than at the index level. This way, on each sorted segment, we could directly jump to the next competitive document, so the collector would actually only collect numWanted documents rather than numToSkip+numWanted. This would require a custom collector however.

asfimport commented 8 years ago

Andres de la Peña (migrated from JIRA)

Ok, I see. It would be great for deep pagination to take advantage of index sorting without requiring the number of documents to skip. (LUCENE-6766|https://issues.apache.org/jira/browse/LUCENE-6766) is really promising :)

I'm attaching a patch adding the aforementioned pagination test to TestEarlyTerminatingSortingCollector. It's my first patch for Lucene so please be indulgent.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Ok, I see. It would be great for deep pagination to take advantage of index sorting without requiring the number of documents to skip. (LUCENE-6766|https://issues\.apache\.org/jira/browse/LUCENE-6766) is really promising

I agree, I think its crucial to not require the user to do a bunch of tracking. Otherwise it defeats the point of the searchAfter method, which is to make it easy for things to be more efficient if you want to page.

We could make pagination work better in the case of sorted segments by tracking the last competitive document per segment rather than at the index level. This way, on each sorted segment, we could directly jump to the next competitive document, so the collector would actually only collect numWanted documents rather than numToSkip+numWanted. This would require a custom collector however.

We should not let "would require a custom collector" prevent exploring this. I see it as, a custom collector is currently already required, and to boot: paging does not work with it :)

Of course, it is important that long-term this stuff can work with searchAfter automatically. But I don't see any proposals for how this can work now, and I'm pretty sure to support it, we need to "track more stuff" on behalf of the user.

There are a lot of ways that could work for this collector, a number like Christine's numToSkip combined with topValue, or Adrien's per-segment set of docIDs, or a per-segment set of docIDs combined with topValue (to keep priority queues constant size on the unsorted segments), and so on.

But I don't think we should worry about this right now. I think the searchAfter api will need some change regardless if we want it to work transparently, and that should be thought out carefully.