apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

Should DocIdSetBuilder have different implementations for point and terms? [LUCENE-10311] #11347

Open asfimport opened 2 years ago

asfimport commented 2 years ago

DocIdSetBuilder has two API implementations, one for terms queries and one for point values queries. In each cases they are used in totally different way.

For terms the API looks like:

 

/**
 * Add the content of the provided {`@link` DocIdSetIterator} to this builder. NOTE: if you need to
 * build a {`@link` DocIdSet} out of a single {`@link` DocIdSetIterator}, you should rather use {`@link`
 * RoaringDocIdSet.Builder}.
 */
void add(DocIdSetIterator iter) throws IOException;

/** Build a {`@link` DocIdSet} from the accumulated doc IDs. */
DocIdSet build() 

 

For Point Values it looks like:

 

/**
 * Utility class to efficiently add many docs in one go.
 *
 * `@see` DocIdSetBuilder#grow
 */
public abstract static class BulkAdder {
  public abstract void add(int doc);

  public void add(DocIdSetIterator iterator) throws IOException {
    int docID;
    while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
      add(docID);
    }
  }
}

/**
 * Reserve space and return a {`@link` BulkAdder} object that can be used to add up to {`@code`
 * numDocs} documents.
 */

/** Build a {`@link` DocIdSet} from the accumulated doc IDs. */
DocIdSet build()  public BulkAdder grow(int numDocs) 

 

 

This is becoming trappy for new developments in the PointValue API.

1) When we call #grow() from the PointValues API, we are not telling the builder how many docs we are going to add (as we don't really know it) but the number of points we are about to visit. This number can be bigger than Integer.MAX_VALUE. Until now, we get around this issue by making sure we don't call this API when we need to add more than Integer.MAX_VALUE points. In that case we will navigate the tree down until the number of points is reduced and they can fit in an int.

This has work well until now because we are calling grow from inside the BKD reader, and the BKD writer/reader makes sure than the number of points in a leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which does not enforce that the number of points on a leaf needs to fit in an int.  This causes friction and inconsistency in the API.

 

2) This a secondary issue that I found when thinking in this issue. In Lucene- we added the possibility to add a DocIdSetIterator from the PointValues API.  Therefore there are two ways to add those kind of objects to a DocIdSetBuilder which can end up in different results:

 

{
  // Terms API
  docIdSetBuilder.add(docIdSetIterator); 
}
{
  // Point values API
  docIdSetBuilder.grow(doc).add(docIdSetIterator)
}

 

I wonder if we need to rethink this API, should we have different implementation for Terms and Point values?


Migrated from LUCENE-10311 by Ignacio Vera (@iverase), updated Mar 16 2022 Pull requests: https://github.com/apache/lucene/pull/692

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit ca73ed1c2842b10c338f1d27ec54cead69ac090e in lucene's branch refs/heads/main from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca73ed1

LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate). (#710)

This computes a pop count on a sample of the longs that back the bitset.

Quick benchmarks suggest that this runs 5x-10x faster than FixedBitSet#cardinality depending on the length of the bitset.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit bb10e62dff886e4fcb3af58b37f2f28959aef3f7 in lucene's branch refs/heads/branch_9x from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene.git;h=bb10e62

LUCENE-10311: Make FixedBitSet#approximateCardinality faster (and actually approximate). (#710)

This computes a pop count on a sample of the longs that back the bitset.

Quick benchmarks suggest that this runs 5x-10x faster than FixedBitSet#cardinality depending on the length of the bitset.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit ae16917c1defd0db6334a4424727157fe12169ab in lucene's branch refs/heads/main from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene.git;h=ae16917

LUCENE-10311: Remove pop_XXX helpers from BitUtil. (#724)

As @rmuir noted, it would be as simple and create less cognitive overhead to use Long#bitCount directly.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit b5fe307c6ff877e1c7479ef7ee628fc8d99c883c in lucene's branch refs/heads/branch_9x from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene.git;h=b5fe307

LUCENE-10311: Remove pop_XXX helpers from BitUtil. (#724)

As @rmuir noted, it would be as simple and create less cognitive overhead to use Long#bitCount directly.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit e999056c19d98b5dbd6434f6986e19c69cdb28ab in lucene's branch refs/heads/main from Dawid Weiss https://gitbox.apache.org/repos/asf?p=lucene.git;h=e999056

LUCENE-10311: avoid division by zero on small sets.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 38b4bbf74e25a5e578486ba434751a3f361912f5 in lucene's branch refs/heads/branch_9x from Dawid Weiss https://gitbox.apache.org/repos/asf?p=lucene.git;h=38b4bbf

LUCENE-10311: avoid division by zero on small sets.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit ea989fe8f305040701527fe3cf9c68ed99dc9c42 in lucene's branch refs/heads/branch_9_1 from Dawid Weiss https://gitbox.apache.org/repos/asf?p=lucene.git;h=ea989fe

LUCENE-10311: avoid division by zero on small sets.