Closed asfimport closed 9 years ago
Ryan McKinley (@ryantxu) (migrated from JIRA)
Without the + (or equivalent) how do you know that everything below that is covered by the shape?
David Smiley (@dsmiley) (migrated from JIRA)
You don't ;-) This is why I believe TermQueryStrategy is fundamentally flawed for indexing non-point shapes. Yet AFAIK it's the choice ElasticSearch wants to use (or at least wanted). In ES if you indexed a country and your search box is something small in the middle of that country, you won't match that country.
To be clear I'm recommending two things:
In both cases, the semantics are the same; no new or fewer documents match. But the spatial index is \~40% smaller I figure, faster indexing as well. It's possible some of the search algorithms for RecursivePrefixTreeStrategy will be slightly slower since sometimes they'll need to visit an additional token at certain parts of the algorithms to check for both leaf and non-leaf indexed cells but I think it'll be quite negligible.
Ryan McKinley (@ryantxu) (migrated from JIRA)
I see – so only index the leaves and traverse the terms for each query rather then a pile of term queries.
Sounds good, but it seems like benchmarking is the only way to know if it is a reasonable tradeoff!
David Smiley (@dsmiley) (migrated from JIRA)
There definitely needs to be benchmarking for spatial; but I feel confident in this case that that it'll be well worth it for RPT; I'm quite familiar with the algorithms in there. It's an unquestionable win-win for TermQueryStrategy.
David Smiley (@dsmiley) (migrated from JIRA)
Somewhat related to this is my newfound realization that indexed non-point shapes will result in IntersectsPrefixTreeFilter (technically it's actually VisitorTemplate) scanning over these smallest grid cells / terms twice and thus calculate intersection twice – once with the leaf flag, once without. This is likely a major performance bug. It would be awkward to fix that right now, but it would be easy once there simply wasn't this redundant indexing of terms – hence this issue.
David Smiley (@dsmiley) (migrated from JIRA)
I'd like to solve this issue (excessive terms) and also address differentiating between fully-contained leaves vs approximated leaves (for LUCENE-5776) in one go tracked on this issue to avoid dealing with back-compat more than once. That is, just once we change how PrefixTree derivative strategies encode the term data, instead of doing over more than one issue. And I'm thinking on trunk wouldn't worry about the back-compat (it is trunk after all), and then the port to 5x would have to consider it – the down-side being some spatial code on trunk vs 5x may vary a bit. Perhaps the back-compat detection in 5x would work via a check for Version similar to Analyzer's having a version property that can optionally be set.
I'm not sure how contentious it may be to simply forgo back-compat. Just re-index. And you're not affected if all you have is point data, which seems to be at least 80% of the users using spatial. And you're furthermore not affected if your pre-existing indexes have non-point data but the only predicate you use is Intersects (no Contains, no Within, no heatmaps). Again I'd guess that lobs off another 80% of users since Intersects is so common.
Ryan McKinley (@ryantxu) (migrated from JIRA)
+1
I'm not sure how contentious it may be to simply forgo back-compat
I expect anyone would would be affected would rejoice at the tradeoff. As is, the people who would be affected either have very few documents or ginormous indexes.
We could take a poll on solr-user to see if anyone is using RPT for non-points and a query is not Intersects (no need to worry about heatmaps... it has not been released yet!)
David Smiley (@dsmiley) (migrated from JIRA)
The attached patch does not have the "+" / "*" (approximated leaf vs contained leaf) leaf type differentiation; that can wait.
Summary of patch changes:
Tests pass; I'll try precommit later. I've yet to try lucene/benchmark and examine the index size change.
David Smiley (@dsmiley) (migrated from JIRA)
Boo ya! It turns out the changes work with the previous index scheme with a trivial change. I tested this by temporarily subclassing RPT to return a CellToBytesRefIterator that has the redundant prefixes on leaves. I'll now create this as another test (subclassing the regular one) so that this stays tested.
David Smiley (@dsmiley) (migrated from JIRA)
I ran a benchmark comparison. The improvements are sweet all around! 33% smaller index 35% less memory while indexing 68% faster indexing (wow) 44% faster searching (wow)
I used a benchmark '.alg' file with quad tree, with reproducibly random circles to index and to search. Search was "Intersects". I attached the spatial.alg should someone want to see more parameters.
Ryan McKinley (@ryantxu) (migrated from JIRA)
nice!
David Smiley (@dsmiley) (migrated from JIRA)
Here's a slightly updated patch, doing the CellTokenStream rename, and addressed the nocommits. ant precommit passes. I think this is ready to commit and get Jenkins beating on it.
I think the Index tweaks planned for #6641 need not bleed into this issue after all since I think back-compat will also come for free (no special code) and besides the leaf cell type differentiation should be set with a flag (do you want the differentiation or not?).
Ryan McKinley (@ryantxu) (migrated from JIRA)
+1 – i did not run the benchmark, but was able to merge and have everything work
great work!
ASF subversion and git services (migrated from JIRA)
Commit 1665656 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1665656
LUCENE-4942: Optimize spatial PrefixTreeStrategy indexes for non-point data
ASF subversion and git services (migrated from JIRA)
Commit 1665674 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1665674
LUCENE-4942: Optimize spatial PrefixTreeStrategy indexes for non-point data
Ryan McKinley (@ryantxu) (migrated from JIRA)
when fixing a failing test for SOLR-7228, it looks like the issue is caused by changes here.
Without this patch, the admin UI shows all tokens having the same value. I mucked with the cone() function and now they show the right values. I don't think cone() is called in normal operation – only for the debug properties in the admin UI
@dsmiley can you take a look?
ASF subversion and git services (migrated from JIRA)
Commit 1665949 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1665949
LUCENE-4942: Fix BytesRefIteratorTokenStream's attribute clone method.
ASF subversion and git services (migrated from JIRA)
Commit 1665951 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1665951
LUCENE-4942: Fix BytesRefIteratorTokenStream's attribute clone method.
Timothy Potter (@thelabdude) (migrated from JIRA)
Bulk close after 5.1 release
Indexed non-point shapes are comprised of a set of terms that represent grid cells. Cells completely within the shape or cells on the intersecting edge that are at the maximum detail depth being indexed for the shape are denoted as "leaf" cells. Such cells have a trailing '+' at the end. Such tokens are actually indexed twice, one with the leaf byte and one without.
The TermQuery based PrefixTree Strategy doesn't consider the notion of 'leaf' cells and so the tokens with '+' are completely redundant.
The Recursive [algorithm] based PrefixTree Strategy better supports correct search of indexed non-point shapes than TermQuery does and the distinction is relevant. However, the foundational search algorithms used by this strategy (Intersects & Contains; the other 2 are based on these) could each be upgraded to deal with this correctly. Not trivial but very doable.
In the end, spatial non-point indexes can probably be trimmed my \~40% by doing this.
Migrated from LUCENE-4942 by David Smiley (@dsmiley), 1 vote, resolved Mar 10 2015 Attachments: LUCENE-4942_non-point_excessive_terms.patch (versions: 2), LUCENE-4942-clone.diff, spatial.alg Linked issues:
6592