apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Indexed non-point shapes index excessive terms [LUCENE-4942] #6006

Closed asfimport closed 9 years ago

asfimport commented 11 years ago

Indexed non-point shapes are comprised of a set of terms that represent grid cells. Cells completely within the shape or cells on the intersecting edge that are at the maximum detail depth being indexed for the shape are denoted as "leaf" cells. Such cells have a trailing '+' at the end. Such tokens are actually indexed twice, one with the leaf byte and one without.

The TermQuery based PrefixTree Strategy doesn't consider the notion of 'leaf' cells and so the tokens with '+' are completely redundant.

The Recursive [algorithm] based PrefixTree Strategy better supports correct search of indexed non-point shapes than TermQuery does and the distinction is relevant. However, the foundational search algorithms used by this strategy (Intersects & Contains; the other 2 are based on these) could each be upgraded to deal with this correctly. Not trivial but very doable.

In the end, spatial non-point indexes can probably be trimmed my \~40% by doing this.


Migrated from LUCENE-4942 by David Smiley (@dsmiley), 1 vote, resolved Mar 10 2015 Attachments: LUCENE-4942_non-point_excessive_terms.patch (versions: 2), LUCENE-4942-clone.diff, spatial.alg Linked issues:

asfimport commented 11 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Without the + (or equivalent) how do you know that everything below that is covered by the shape?

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

You don't ;-) This is why I believe TermQueryStrategy is fundamentally flawed for indexing non-point shapes. Yet AFAIK it's the choice ElasticSearch wants to use (or at least wanted). In ES if you indexed a country and your search box is something small in the middle of that country, you won't match that country.

To be clear I'm recommending two things:

In both cases, the semantics are the same; no new or fewer documents match. But the spatial index is \~40% smaller I figure, faster indexing as well. It's possible some of the search algorithms for RecursivePrefixTreeStrategy will be slightly slower since sometimes they'll need to visit an additional token at certain parts of the algorithms to check for both leaf and non-leaf indexed cells but I think it'll be quite negligible.

asfimport commented 11 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

I see – so only index the leaves and traverse the terms for each query rather then a pile of term queries.

Sounds good, but it seems like benchmarking is the only way to know if it is a reasonable tradeoff!

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

There definitely needs to be benchmarking for spatial; but I feel confident in this case that that it'll be well worth it for RPT; I'm quite familiar with the algorithms in there. It's an unquestionable win-win for TermQueryStrategy.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Somewhat related to this is my newfound realization that indexed non-point shapes will result in IntersectsPrefixTreeFilter (technically it's actually VisitorTemplate) scanning over these smallest grid cells / terms twice and thus calculate intersection twice – once with the leaf flag, once without. This is likely a major performance bug. It would be awkward to fix that right now, but it would be easy once there simply wasn't this redundant indexing of terms – hence this issue.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I'd like to solve this issue (excessive terms) and also address differentiating between fully-contained leaves vs approximated leaves (for LUCENE-5776) in one go tracked on this issue to avoid dealing with back-compat more than once. That is, just once we change how PrefixTree derivative strategies encode the term data, instead of doing over more than one issue. And I'm thinking on trunk wouldn't worry about the back-compat (it is trunk after all), and then the port to 5x would have to consider it – the down-side being some spatial code on trunk vs 5x may vary a bit. Perhaps the back-compat detection in 5x would work via a check for Version similar to Analyzer's having a version property that can optionally be set.

I'm not sure how contentious it may be to simply forgo back-compat. Just re-index. And you're not affected if all you have is point data, which seems to be at least 80% of the users using spatial. And you're furthermore not affected if your pre-existing indexes have non-point data but the only predicate you use is Intersects (no Contains, no Within, no heatmaps). Again I'd guess that lobs off another 80% of users since Intersects is so common.

asfimport commented 9 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

+1

I'm not sure how contentious it may be to simply forgo back-compat

I expect anyone would would be affected would rejoice at the tradeoff. As is, the people who would be affected either have very few documents or ginormous indexes.

We could take a poll on solr-user to see if anyone is using RPT for non-points and a query is not Intersects (no need to worry about heatmaps... it has not been released yet!)

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

The attached patch does not have the "+" / "*" (approximated leaf vs contained leaf) leaf type differentiation; that can wait.

Summary of patch changes:

Tests pass; I'll try precommit later. I've yet to try lucene/benchmark and examine the index size change.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Boo ya! It turns out the changes work with the previous index scheme with a trivial change. I tested this by temporarily subclassing RPT to return a CellToBytesRefIterator that has the redundant prefixes on leaves. I'll now create this as another test (subclassing the regular one) so that this stays tested.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I ran a benchmark comparison. The improvements are sweet all around! 33% smaller index 35% less memory while indexing 68% faster indexing (wow) 44% faster searching (wow)

I used a benchmark '.alg' file with quad tree, with reproducibly random circles to index and to search. Search was "Intersects". I attached the spatial.alg should someone want to see more parameters.

asfimport commented 9 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

nice!

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Here's a slightly updated patch, doing the CellTokenStream rename, and addressed the nocommits. ant precommit passes. I think this is ready to commit and get Jenkins beating on it.

I think the Index tweaks planned for #6641 need not bleed into this issue after all since I think back-compat will also come for free (no special code) and besides the leaf cell type differentiation should be set with a flag (do you want the differentiation or not?).

asfimport commented 9 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

+1 – i did not run the benchmark, but was able to merge and have everything work

great work!

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1665656 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1665656

LUCENE-4942: Optimize spatial PrefixTreeStrategy indexes for non-point data

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1665674 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1665674

LUCENE-4942: Optimize spatial PrefixTreeStrategy indexes for non-point data

asfimport commented 9 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

when fixing a failing test for SOLR-7228, it looks like the issue is caused by changes here.

Without this patch, the admin UI shows all tokens having the same value. I mucked with the cone() function and now they show the right values. I don't think cone() is called in normal operation – only for the debug properties in the admin UI

@dsmiley can you take a look?

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1665949 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1665949

LUCENE-4942: Fix BytesRefIteratorTokenStream's attribute clone method.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1665951 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1665951

LUCENE-4942: Fix BytesRefIteratorTokenStream's attribute clone method.

asfimport commented 9 years ago

Timothy Potter (@thelabdude) (migrated from JIRA)

Bulk close after 5.1 release