Indexed non-point shapes index excessive terms [LUCENE-4942]

asfimport commented 11 years ago

Indexed non-point shapes are comprised of a set of terms that represent grid cells. Cells completely within the shape or cells on the intersecting edge that are at the maximum detail depth being indexed for the shape are denoted as "leaf" cells. Such cells have a trailing '+' at the end. Such tokens are actually indexed twice, one with the leaf byte and one without.

The TermQuery based PrefixTree Strategy doesn't consider the notion of 'leaf' cells and so the tokens with '+' are completely redundant.

The Recursive [algorithm] based PrefixTree Strategy better supports correct search of indexed non-point shapes than TermQuery does and the distinction is relevant. However, the foundational search algorithms used by this strategy (Intersects & Contains; the other 2 are based on these) could each be upgraded to deal with this correctly. Not trivial but very doable.

In the end, spatial non-point indexes can probably be trimmed my \~40% by doing this.

Migrated from LUCENE-4942 by David Smiley (@dsmiley), 1 vote, resolved Mar 10 2015 Attachments: LUCENE-4942_non-point_excessive_terms.patch (versions: 2), LUCENE-4942-clone.diff, spatial.alg Linked issues:

6592

asfimport commented 11 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

Without the + (or equivalent) how do you know that everything below that is covered by the shape?

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

You don't ;-) This is why I believe TermQueryStrategy is fundamentally flawed for indexing non-point shapes. Yet AFAIK it's the choice ElasticSearch wants to use (or at least wanted). In ES if you indexed a country and your search box is something small in the middle of that country, you won't match that country.

To be clear I'm recommending two things:

Have TermQueryStrategy not index its leaves with the '+' – it doesn't use them.
Have RecursivePrefixTreeStrategy only index the leaf versions of those leaf cells, not a redundant non-leaf version. Some non-trivial code needs to change in a few of the search algorithms.

In both cases, the semantics are the same; no new or fewer documents match. But the spatial index is \~40% smaller I figure, faster indexing as well. It's possible some of the search algorithms for RecursivePrefixTreeStrategy will be slightly slower since sometimes they'll need to visit an additional token at certain parts of the algorithms to check for both leaf and non-leaf indexed cells but I think it'll be quite negligible.

asfimport commented 11 years ago

Ryan McKinley (@ryantxu) (migrated from JIRA)

I see – so only index the leaves and traverse the terms for each query rather then a pile of term queries.

Sounds good, but it seems like benchmarking is the only way to know if it is a reasonable tradeoff!

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

There definitely needs to be benchmarking for spatial; but I feel confident in this case that that it'll be well worth it for RPT; I'm quite familiar with the algorithms in there. It's an unquestionable win-win for TermQueryStrategy.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Somewhat related to this is my newfound realization that indexed non-point shapes will result in IntersectsPrefixTreeFilter (technically it's actually VisitorTemplate) scanning over these smallest grid cells / terms twice and thus calculate intersection twice – once with the leaf flag, once without. This is likely a major performance bug. It would be awkward to fix that right now, but it would be easy once there simply wasn't this redundant indexing of terms – hence this issue.