apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Spatial, enhance RPT to differentiate confirmed from non-confirmed hits, then validate with SDV [LUCENE-5579] #6641

Closed asfimport closed 9 years ago

asfimport commented 10 years ago

If a cell is within the query shape (doesn't straddle the edge), then you can be sure that all documents it matches are a confirmed hit. But if some documents are only on the edge cells, then those documents could be validated against SerializedDVStrategy for precise spatial search. This should be much faster than using RPT and SerializedDVStrategy independently on the same search, particularly when a lot of documents match.

Perhaps this'll be a new RPT subclass, or maybe an optional configuration of RPT. This issue is just for the Intersects predicate, which will apply to Disjoint. Until resolved in other issues, the other predicates can be handled in a naive/slow way by creating a filter that combines RPT's filter and SerializedDVStrategy's filter using BitsFilteredDocIdSet.

One thing I'm not sure of is how to expose to Lucene-spatial users the underlying functionality such that they can put other query/filters in-between RPT and the SerializedDVStrategy. Maybe that'll be done by simply ensuring the predicate filters have this capability and are public.

It would be ideal to implement this capability after the PrefixTree term encoding is modified to differentiate edge leaf-cells from non-edge leaf cells. This distinction will allow the code here to make more confirmed matches.


Migrated from LUCENE-5579 by David Smiley (@dsmiley), 1 vote, resolved Apr 10 2015 Attachments: LUCENE-5579_CompositeSpatialStrategy.patch (versions: 2), LUCENE-5579_SPT_leaf_covered.patch, spatial.alg Linked issues:

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

This first patch is the first step / phase, which differentiates "covered" leaf cells (cells that were within the indexed shape) from other cells which are approximated leaf cells. The next phase will be augmenting the Intersects filter and possibly others to collect exact hits in conjunction with the fuzzy hits.

During the benchmarking I learned some interesting things:

The attached patch includes some refactoring to share common logic between Contains & AVPTF (the base of Within, Intersects, and heatmap). I need to add a configurable flag to indicate if leaves should be differentiated in the first place, since you might not want that, and another flag to adjust how much pruning of the covered leaves happens. Both flags should be safe to change without any re-indexing; it could be changed whenever. Obviously if you don't have the covered leaf differentiation then you won't get the full benefit later when we have exact match collection, just partial.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I moved the leaf cell differentiation to its own issue where it belongs – #7423.

The attached patch addresses the two-phase index + verify requirement in one new "CompositeStrategy" so that we can get speed + accuracy and with convenience made possible by Lucene 5.1's low-level TwoPhaseIterator. I'm not married to the name; I'm not sure what to call it. This patch contains two Query implementations – one specifically for optimizing the Intersects predicate which uniquely retains a DocIdSet of "exact" (aka pre-confirmed) hits separate from the approximate set overall. The other one is more generic but must verify every hit by looking up the geometry and applying the predicate.

There are a lot of TODO comments and I haven't tested it thoroughly to my liking yet. So it's not done, but it does seem to work.

I did some benchmarking but had to stop as I can't spend more time on this right now. I'm puzzled why I don't see any performance improvements using the optimized Intersects predicate. I debugged it and observed that the benchmark setup I have doesn't seem to be yielding any exact/confirmed hits, oddly enough. Yet the testing I have does show this happens. So maybe it's a benchmark config bug, or who knows.

The patch includes a refactoring to make org.apache.lucene.search.ConstantScoreWeight public instead of package level. It's very general & useful. I think a similar thing could be done with a constant scoring Scorer.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I fixed a bug (some unfinished code I overlooked) and I finally see the performance numbers I've been expecting to see. With the new approx/exact differentiated Intersects predicate, the benchmarked queries were \~83% faster compared to without. YMMV a ton. These shapes were all geodetic circles; which do have some trig but I bet a polygon, esp. a non-trivial polygon, should see more improvement. This test used distErrPct=0.2 which will yield a tiny index & fast indexing but super-approximated shapes (very blocky looking). By using distErrPct=0.1, the relative improvement became 100% (2x) since more detail allows more hits to be in the "exact" bucket. The index increased in size 93% though. Note even at 0.1, this index is about 1/4th the size of the default RPT configuration.

Now I need to wrap up the TODOs; including test a bit more. Maybe re-think the name of this thing; although CompositeSpatialStrategy ain't bad. Perhaps this could all go right into SerializedDVStrategy and then make this index portion being added here optional? On the other hand... SerializedDVStrategy is but one specific way (BinaryDocValues) to retrieve the shape. Granted we don't have any alternative similar nor do I plan to come up with one. Or this code could go into RPT, so that you could optionally add the precision of the serialized geometry if you so choose. Hmmm.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

This is the latest patch, along with a benchmark .alg file (which depends on #7459 to make a comparison in one run). For whatever reason, I'm only seeing a 40% increase in speed now; I'm not sure what changed in the benchmark or recent trunk changes. Again, YMMV a ton.

@jpountz In this patch I added a method to BitDocIdSet.Builder:

    /**
     * Is this builder definitely empty?  If so, {`@link` #build()} will return null.  This is usually the same as
     * simply being empty but if this builder was constructed with the {`@code` full} option or if an iterator was passed
     * that iterated over no documents, then we're not sure.
     */
    public boolean isDefinitelyEmpty() {
      return sparseSet == null && denseSet == null;
    }

Cool? Another non-spatial change in this patch is making ConstantScoreWeight public, not package-local. Should be be `@lucene.experimental or@lucene`.internal? It seems generic enough.

At this point I think it's ready to commit. I haven't cared enough about where the code should live to bother changing it from it's current form as an independent SpatialStrategy.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672727 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1672727

LUCENE-5579: BitDocIdSet.Builder.isDefinitelyEmpty()

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672729 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1672729

LUCENE-5579: BitDocIdSet.Builder.isDefinitelyEmpty()

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672730 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1672730

LUCENE-5579: ConstantScoreWeight is now public but marked @lucene.internal

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672731 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1672731

LUCENE-5579: ConstantScoreWeight is now public but marked @lucene.internal

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672736 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1672736

LUCENE-5579: CompositeSpatialStrategy (RPT + SDV) with optimized Intersect

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672740 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1672740

LUCENE-5579: CompositeSpatialStrategy (RPT + SDV) with optimized Intersect

asfimport commented 9 years ago

Anshum Gupta (@anshumg) (migrated from JIRA)

Bulk close for 5.2.0.