apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Add PackedQuadPrefixTree [LUCENE-6422] #7481

Closed asfimport closed 9 years ago

asfimport commented 9 years ago

This task introduces a PackedQuadPrefixTree that includes two things:

  1. A packed 8 byte representation for a QuadCell, including more efficient implementations of the SPT API than the existing QuadPrefixTree or GeoHashPrefixTree.
  2. An alternative implementation to RPT's "pruneLeafyBranches" that streams the cells without buffering them all, which is way more memory efficient. However pruning is limited to the target detail level, where it accomplishes the most good.

Future improvements over this approach may include the generation of the packed cells using an AutoPrefixAutomaton


Migrated from LUCENE-6422 by Nick Knize (@nknize), resolved Apr 23 2015 Attachments: LUCENE-6422_with_SPT_factory_and_benchmark.patch, LUCENE-6422.patch (versions: 5), LUCENE-6422-TRUNK.patch

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Developed and tested against branch_5x. I need to run more rigorous (and systematic) benchmarking but preliminary tests show a 90% reduction in index size on exotic cases (high precision).

One particular shape (the political boundary of Wales): using QuadPrefixTree (w/ RecursivePrefixStrategy) consumed 1G of memory at TreeLevel 26 with distance_error_pct: 0. The new PackedQuadPrefixTree brought this down to just over 80mb with the same precision.

There are many improvements remaining (including using variable byte array instead of 8 bytes for even the lowest levels). But this provides initial progress that should open the door for better precision on extreme shapes.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Awesome work Nick! It's so nice to see meaty spatial contributions like this (Geo3d is another example).

RE "Streaming" (transient memory use while indexing): I appreciate that the out-of-the box configuration of RPT with either LegacyPrefixTree (be it quad or geohash) will use a lot of memory for indexing. But since... I don't know how long now, this only occurs if the "leafy branch pruning" optimization is enabled on RPT. That algorithm, existing on RecursivePrefixTreeStrategy, unfortunately buffers all the cells. It's somewhat simple; it could be improved to not buffer all cells but it would need to buffer some. Recently I did some benchmarking and found that the leafy branch pruning yielded lots of index size savings, particularly with the quad tree. I'd love to chat with you about the subject of "leaves" on the SPT and an idea I have on doing better. Any way, I suggest you do another memory benchmark with leafy branch pruning disabled with the PackedQuadTree but not the StreamingQuad...Strategy. With it disabled, the underlying BytesRefIteratortokenStream will consume a Iterator<Cell> that is a direct instance of TreeCellIterator, and then you get the "streaming" effect. The existing TreeCellIterator is quite similar to the Streaming...PrefixTreeIterator here. If I'm right about there being no appreciable memory savings, then this part of the patch can be removed as it's redundant.

I really like the new PackedQuadPrefixTree.java. (IMO that's what this JIRA issue is mostly about) Can you consider not subclassing Legacy* ? I'd like to leave the legacy trees as-is and new SPTs not inherit from it. Can you base your next patch off of trunk? And can you either post on reviewboard.apache.org or use a GitHub fork & branch so I can provide by-line feedback?

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Thanks David. I will certainly be doing a more thorough benchmark and will start with the suggestions. I imagine the savings will not be as extreme as in the situation with Wales (that was just the most interesting case.)

With it disabled, the underlying BytesRefIteratortokenStream will consume a Iterator<Cell> that is a direct instance of TreeCellIterator, and then you get the "streaming" effect.

Just a few thoughts, the StreamingPrefixTreeIterator gives the benefit of a few worlds:

  1. It uses an on-demand DFS through bit shifting completely eliminating the need for the stack DFS logic in TreeCellIterator.hasNext. I suppose code-wise it would be cleaner to subclass TreeCellIterator and just override hasNext (and possibly next since I don't set thisCell/current in next)? That's a good idea for code maintenance and reuse.
  2. The on-demand DFS traversal already achieves a "leafy branch pruning" effect by not descending on Cells that already fall "within" the shape. This gives you pruning without having to buffer anything (other than the current and next cell). This does vary a little bit in that the RPT simply prunes all 4 "leaves" that "intersect" the shape.

Can you consider not subclassing Legacy* ?

I'll certainly take a look at this. I saw the comment about not subclassing and thought about it, but since there is so much reuse with the bytes[], b_off, and b_len (which could be a BytesRef) it didn't make much sense duplicating code. Are you suggesting duplicating code and eventually deprecating the LegacyCell?

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

  1. It uses an on-demand DFS through bit shifting completely eliminating the need for the stack DFS logic in TreeCellIterator.hasNext. I suppose code-wise it would be cleaner to subclass TreeCellIterator and just override hasNext (and possibly next since I don't set thisCell/current in next)? That's a good idea for code maintenance and reuse.

Interesting; I need to look at that code closer. Nonetheless, I think a stack vs completely stream one by one is unlikely to be noticeable in a benchmark. The memory use is capped at a small amount either way. If we find there are measurable advantages here, then couldn't you provide this via overriding SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?

  1. The on-demand DFS traversal already achieves a "leafy branch pruning" effect by not descending on Cells that already fall "within" the shape. This gives you pruning without having to buffer anything (other than the current and next cell). This does vary a little bit in that the RPT simply prunes all 4 "leaves" that "intersect" the shape.

I think we may be talking about different things. A "leafy branch" as I defined it in Lucene spatial is a parent cell in which all sub-cells that could theoretically exist are there for this shape and are also leaves. So for a quad-tree, it's a parent with 4 sub-cells that are all leaves. A "leaf" is a cell that is either within the shape from which it was derived or it overlaps the edge but we've reached a precision threshold. Indexing the 4 leaves produces a larger index than not emitting those leaves at all and instead marking the parent as a leaf – and the end effect is semantically equivalent in terms of the search predicates. There are some trade-offs; but I won't digress now.

Are you suggesting duplicating code and eventually deprecating the LegacyCell?

Yes but I need to apply the patch and see how much code is at stake here. At the time I introduced "LegacyCell", I refined the SpatialPrefixTree API and was unsatisfied with the only two implementations at that time, in terms of efficiencies, so I called it LegacyPrefixTree and LegacyCell. Perhaps these could still stick around still; I welcome your input on what it might be named or if there is too little code to worry about. But keeping having this extend "LegacyCell" is problematic because RecursivePrefixTreeStrategy makes an assumption for the branch pruning you can see there, right at the beginning.

I welcome whatever thoughts you may have on the API to make it more clear, faster, whatever.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Can we commit both approaches for "streaming" and refactor later (progress not perfection)?

I like that the new streaming approach has fewer abstractions on quick glance.

I also think further benchmarks need not block progress: they can come later.

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

I think a stack vs completely stream one by one is unlikely to be noticeable in a benchmark.

That's a possibility. It definitely warrants a benchmark to know for sure. I think it could be noticeable for "deep" QuadTrees? (aka: anything with high precision) as the string representation for one child can get upwards of 29 bytes vs. 8. But its really speculation without benchmark confirmation. Also, just eliminating the need for stack-based DFS logic is a more straightforward approach, no? (no crazy data structures with transient state needed since we're just lexicographically incrementing a compact bit representation on demand)

...couldn't you provide this via overriding SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?

I like this suggestion. If I set pruneLeafyBranches to false and override getTreeCellIterator in PackedQuadPrefixTree then, yes, RPT will call the PackedQuadPrefixTree.getTreeCellIterator from RPT.createCellIteratorToIndex and I can eliminate the StreamingPrefixTreeStrategy class altogether. At that point I would suggest we rename RecursivePrefixTreeStrategy to something more "accurate", but that can be an API discussion.

I think we may be talking about different things.

We're talking about the same thing. With one difference in this implementation. At the moment leaves that are on the edge (e.g., intersects) are not pruned. They probably could be - and this optimization would save even more space than its already saving.

and the end effect is semantically equivalent in terms of the search predicates.

I forgot to mention, this enhancement allows one to disable that "memory saving error factor" (by setting distance_error_pct to 0) and the result is quad cell precision regardless of shape size (with minimal increase in index size). This was the initial driver for the improvement. Allowing one to eliminate false positives - achieve results that are within the precision they specify without having a fuzzy error factor based on the size of the shape. It doesn't eliminate that factor, it just provides the ability to disable it and achieve your desired precision (subject to earth curvature error) without saturating memory and blowing out the index.

Yes but I need to apply the patch and see how much code is at stake here.

I honestly think leaving that to another issue is the best strategy (progress not perfection). That's something I can look at as a next step. Until that time comes I don't think there's a problem having this enhancement subclass QuadCell and QuadPrefixTree. Refactoring can be deferred to a separate issue.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Just a quick comment... I can see this patch assumes Java 8, yet Lucene 5x is on Java 7. Long.BYTES doesn't exist, and there is no "default" remove method on Iterator (for PrefixTreeIterator).

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

See this new patch.

I did some benchmarking – pretty quick & rough right now. I set max levels to 20 (chosen arbitrarily; with 27 it choked on memory given 2GB heap), with distErrPct of 0.0, and indexing random circles up to 3 decimal degrees (a few hundred KM or so), and disabling leafy branch pruning to compare apples to apples.

I ran it with "quad" and "packedQuad" with the same settings otherwise.

I was skeptical there would be index size savings and the benchmark shows there aren't any. Please prove me wrong, Nick! I like the indexing & query speed improvements – not surprised given the nice code here without the ugly recursion that was in legacy Quad.

off to bed now...

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Awesome for putting together the benchmark (even a rough one). Can you attach the spatial.alg file you used so I can verify (I'm assuming its a variant on one you've posted before?) I'm getting very different numbers with "production" data sets (e.g., high res peninsulas, islands, global political boundaries, planet osm data - e.g., more exotic shapes than circles larger than a few hundred KM)

A few questions... and other observations.

chosen arbitrarily; with 27 it choked on memory given 2GB heap

What choked specifically? I'm using PackedQuad with depth between 26 and 29. 1GB heap size using the shapes I described above.

disabling leafy branch pruning to compare apples to apples

Out of curiosity, why is this option enabled by default if it uses transient storage that doubles memory consumption? Seems backwards to me.

I was skeptical there would be index size savings and the benchmark shows there aren't any.

IMHO I would avoid these kinds of absolute statements (especially with the highly variable nature of spatial use-cases). In this situation your numbers do not surprise me when disabling that leafyBranchPrune option (which still confuses me why its there?), and using a single shape type with variable size. There is an outstanding issue in the existing patch (I'll see if I can't push out a fix today) - the TermsEnum is returning Terms with BytesRef containing bytes[] that are double the size than they should be (e.g., 16 bytes instead of 8 - all padded w/ zeros). I suspect its some improper configuration in the reader? So for every high res cell (e.g.), the term will be 16 bytes (still better than, say, 27).

I think we can do better on simulated test data in the test framework. I love the randomization and what minimal "real" data sets that are there are great. It does not provide the coverage necessary, though, to best simulate some real world scenarios. That's okay, to steal Mike's quote "progress not perfection". I'll definitely work to provide some more real world tests so we have better coverage and benchmarking options using "real world" data. Its a good way to recommend one indexing structure over another (this one's just the beginning. There are more indexing structures in trial mode, and even more improvements for the packed version)

Let's keep this going... Since this patch is non-destructive I don't see a reason it can't be committed as another option and I can submit enhancement patches to this feature. That would be up to the community.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Can you attach the spatial.alg file you used so I can verify

It's in the patch.

What choked specifically? I'm using PackedQuad with depth between 26 and 29. 1GB heap size using the shapes I described above.

-Xmx2G resulted in an OutOfMemoryError, be it for the legacy quad or this new quad one.

Out of curiosity, why is this option enabled by default if it uses transient storage that doubles memory consumption? Seems backwards to me.

"leafy branch pruning" is enabled by default for RPT, although "StreamingPrefixTreeStrategy" overrides the method that would trigger it. In order to compare another tree (legacy quad in this case) fairly to packed-quad, I disabled leafy branch pruning.

Please don't call what StreamingPrefixTreeStrategy does as leafy branch pruning; it confuses the important distinction I'm trying to make. All SPTs should stop traversing when the relation is within – that's expected/normal.

IMHO I would avoid these kinds of absolute statements (especially with the highly variable nature of spatial use-cases).

I'm sorry. To be more clear, I just don't yet understand how this packed quad encoding is going to allow for distErrPct=0. I'd like to understand; please help me. Knowing what I do know about the SPTs, the results were what I expected – similar disk size to existing quad.

I think we can do better on simulated test data in the test framework.

Yes! I would love to have more realistic shapes to test with.

RE progress not perfection: Yes, Mike uses that quote constantly and I even saw it in your code :-) I have my catch-phrases too. Are you and Mike in a hurry to see this committed? It's very normal, in this open-source project anyway, that there is back & forth & peer-review and changes that are asked of the contributor. Don't worry; something is going to get committed – the query speed is a nice improvement! It's not quite done – that's all.

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Please don't call what StreamingPrefixTreeStrategy does as leafy branch pruning; it confuses the important distinction I'm trying to make.

That's fair. pruneLeafyBranches isn't a Geo text book term and I didn't have a common nomenclature within this context for discussion re: the behavior so I used the closest shared term, for discussion. If I change it so the leaves are pruned on intersects then its the same behavior. I'll add that to the patch for benchmarking and it will result in a smaller index. Thinking about it more, I think it's the right way to go and a simple enough change that can be iterated in future improvements.

All SPTs should stop traversing when the relation is within – that's expected/normal.

I want to make sure there is no confusion here (especially for anyone else willing to participate and the sensitivity surrounding my use of the pruneLeafyBranches term). SPT's in the context of lucene-spatial's definition, no? (since SPT is not a common name for a geo data structure its a derivative) There are use-cases for traversing beyond the within state in GeoSpatial data structures. So someone might come along and contribute the "unexpected". Just want to be prepared for that future discussion.

I just don't yet understand how this packed quad encoding is going to allow for distErrPct=0

That helps me better understand the confusion. Your description of the transient memory behavior surrounding pruneLeafyBranches helped me understand why high res shapes w/ distErrPct=0 causes QuadTree to OOM on even reasonably sized shapes (thank you for that clarification). In your benchmark there's a closer performance to the PackedQuad with it disabled (so I still don't understand why enabled is the default). In essence, the rough benchmark you performed doesn't benchmark default Lucene spatial behavior. Is that not a problem?

I would love to have more realistic shapes to test with.

Awesome! I'll work on adding some of the realistic data to the benchmark. A lot of it is quite big so I'll have to add to the build.xml to pull a compressed version from somewhere. What's cool is that It includes a healthy amount of exotic shapes which better reflect complex geospatial use-cases. Again, that's where you see drastic performance improvements with PackedQuad. The Leaves are always 8 bytes regardless of depth (to a max depth of 29) It introduces at most a 7 byte overhead at low precision (which are fewer terms anyway), so just use QuadTree in those cases, but there's a 21 byte savings on leaves (again, exotic shapes). This is how it improves over QuadTree with distErrPct=0. You don't need to reduce resolution for those large shapes.

Are you and Mike in a hurry to see this committed?

Apologies if you perceived a level of urgency on my behalf. Regarding Mike, I was giving credit for his quote because I agree re: non-destructive enhancements such as this. Other than that, I can't speak on his behalf and I don't think its fair to speculate on his level of interest re: this single contribution.

(soapbox) So you know where I stand, I would simply prefer this (and future spatial contributions) not drag out as long as the geo3d contribution has. IMHO, I agree with you 100% that peer review and discussion is healthy and necessary. I would add, though, there's a tipping point where it becomes a blocker and prevents others from participating. And for this package to improve beyond the great work already done, there needs to be more involvement and some level of organic growth (iterative improvements).

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I welcome any suggestions you have pertaining naming (or again on anything). If you can suggest a better name to what "leafy branch pruning" does, then at a minimum it could be expressed in the javadocs. Not long ago it had another name but it was even worse ;-). Naming is hard. Likewise... if SpatialPrefixTree might have a better literature/industry based name then I'd love to know what that is. When I named it as-such I looked around but didn't seem to find anything that was perfect. It's a type of Trie, a spatial trie, and "PrefixTree" is a synonym for Trie, so... there you go. Maybe I didn't look hard enough. I'm sorry if I seemed touchy on the terminology; I merely want to ensure we mutually understood what we were and weren't talking about – that's all. So when you proposed that what StreamingQuadPrefixTree did was leafy branch pruning, and as the coiner of the term I can see it didn't meet my definition; clarification is in order.

I want to make sure there is no confusion here (especially for anyone else willing to participate and the sensitivity surrounding my use of the pruneLeafyBranches term). SPT's in the context of lucene-spatial's definition, no? (since SPT is not a common name for a geo data structure its a derivative)

Yes, in the context of lucene-spatial's definition.

There are use-cases for traversing beyond the within state in GeoSpatial data structures. So someone might come along and contribute the "unexpected". Just want to be prepared for that future discussion.

Interesting; I'm curious what that might be.

A lot of it is quite big so I'll have to add to the build.xml to pull a compressed version from somewhere.

Yes, as you may have observed, this is how all existing test data is handled in the benchmark module: fetched & decompressed on-demand. I added the Geonames point data. And I added facilities to SpatialDocMaker to automatically turn the points into circles & rects of random size.

I'm looking forward to seeing PackedQuadPrefixTree kick ass in being compressed allowing distErrPct=0 – I'm still puzzled but I look forward to being pleasantly astonished. I do understand that it uses all 8 bits (instead of just 2 as the legacy impl) of the first number of bytes for the quadrant info, and that should lead to shorter terms / prefixes for higher-precision data. But I don't see that there would be any change in the number of terms, which is, as I see, the scalability challenge. I have an idea on solving this floating in my head but maybe I needn't ponder it longer if PackedQuadPrefixTree handles it somehow. It's not obvious to me but where in the code of PackedQuadCell are the 5 "depth bits" encoded & decoded?

In essence, the rough benchmark you performed doesn't benchmark default Lucene spatial behavior. Is that not a problem?

It is a problem – it's not ideal, that is. Preferably it would stay enabled but I think you indicated it's not supported by PackedQuadTree? I didn't look closer.

RE Geo3d: As an example for anything with me; it's an outlier, not an example that proves the rule. If I am blind to facts I can't see then show me. Hopefully you noticed that Karl and I are working together and it's come real far now (lots of discussion on ReviewBoard & bugs I helped Karl find via randomized testing). Your patch here has nothing in common with it – PackedQuadTree obviously belongs here, and it should be quite evident that I'm here helping by providing feedback and running a benchmark. And yes, being critical. Anyway... lets get back to work.

The only thing about this patch that is a blocker (-1) for me is StreamingPrefixTreeStrategy. Not most of the code in it, but it's existence as a SpatialStrategy subclass instead of an SPT subclass. I don't mind that it optimizes something that I don't think needed optimization, though I suspect if you noticed TreeCellIterator originally you wouldn't have bothered.

I have other by-line feedback I'd like to give (and would prefer to do so via ReviewBoard/GitHub) but we needn't steamroll this in under the mantra of "progress not perfection".

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's very normal, in this open-source project anyway, that there is back & forth & peer-review and changes that are asked of the contributor. Don't worry; something is going to get committed – the query speed is a nice improvement! It's not quite done – that's all.

I agree iterating/peer review is healthy, but growing the community is also very important in open-source, and at least for the geo3d issue and now this one it worries me when I see barriers being put up that I feel should not be blockers issues for committing.

Especially when bus factor is essentially one, in an area as important as spatial, I think encouraging contributions / growing community becomes incredibly important. It's like when we humans intervene for an endangered species... after having caused their predicament in the first place ... sigh.

Of course, if there are real technical objection/problems/quality issues for a given patch, those should be addressed before committing.

It's important to show new people we are eager and excited for their contributions, that the bar is not so high for them to have an impact. We can always review/iterate/benchmark after they are committed, as long as net/net the patch is a step forward as (I think?) this one is.

I also wonder whether we need a new, lighter weight spatial module (spatial2? spatia_light?), or maybe spatial_sandbox, where the barrier is lower? The levels of abtractions in the current module look excessive to me and with both the geo3d issue and this issue, "correctly fitting in to the existing abstractions" seemed to be one of the barriers (e.g. your only blocker here ("The only thing about this patch that is a blocker (-1) for me is StreamingPrefixTreeStrategy..") seems to be such an issue). So if we had a more free "sandbox" the barrier is lower by design.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

We can always review/iterate/benchmark after they are committed, as long as net/net the patch is a step forward as (I think?) this one is.

Mike, this project is defacto run as review then commit, not the other way around. This isn't news to you, of course; it's weird to feel I need to remind you – we all know this for good & bad. I have the same expectation of patches I send elsewhere. I'm not looking for perfection, just moving one thing around (a "quality" issue as-is IMO), and the rest is minor. I appreciate the points you make generally, but I don't feel I'm setting the bar too high on this issue. Because of the fuss being made here; it's going to be harder for me to give quality feedback on future patches without fear of inducing needless drama. And that's not good; Lucene improves through solid peer review that I see on many patches (particularly yours, by the way).

It's important to show new people we are eager and excited for their contributions

Let it be known that weeks ago, upon discovery that Elastic had a new spatial SME, I contacted Nick to do a video chat so that I could basically welcome him into the Lucene/Spatial4j spatial area, that I looked forward to seeing what he will contribute, and that I seek'ed his honest impressions / feedback. I simply can't be more sincere, and I hope this demonstrates that I don't want to be the only spatial person around here.

RE abstractions: I respect your opinion although I don't agree that there are too many here. On the Geo3D front, it's extremely close now to being committed, and the only abstraction it is fitting into is a Spatial4j Shape interface (BTW Geo3D internally has a bunch of abstractions and I didn't complain to Karl as to the merits of them). It would be good to have one more Spatial4j abstraction hook but it's not a blocker. On this patch, I'm asking Nick to scale it back one, to just the SpatialPrefixTree (which technically includes "Cell" as that's what SPTs generate). And because of these abstractions, both of them in fact (Shape & SpatialPrefixTree), it's possible to re-use indexing and search predicates (Intersects, Within, Contains) and other code that work with these abstractions, without either implementations needing to know they exist. I think it's very powerful and awesome that these things can be combined / leveraged. Of course I don't believe any of these APIs are perfect; the specifics are debatable and I wish there were more people improving it than just me to help make it even better. Now it looks like there are :-)

Sigh; these conversations are stressfull – I look forward to getting back to the technical matters of the patch and moving it forward. @nknize can you please post a new patch:

My added patch had two little extras: the factory & benchmark module integration. It would be nice if you could add those. If you don't, I will any how, perhaps with a factory test which I didn't yet do. In fact if I sadly never hear from you again, I'll do all that is spoken of here because we'd all love for this awesomeness to get into Lucene spatial. You've done all the hard work; what remains is small.

You certainly don't have to prove to me that the indexes are lean... or rather can be lean under "realistic" circumstances. I remain very curious about that.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

baselined on Lucene trunk (standard practice for contributing to Lucene)

Please don't require this of contributors: it is not standard practice. Make the bar as low as possible!

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

That's fair; you're right that the bar shouldn't be the same for committers vs. contributors.

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Updated to remove StreamingPrefixTreeStrategy. PackedQuadTree is now self contained to one file and uses the RecursivePrefixTreeStrategy but ignores leafyBranchPruning.

Still only integrated and tested on branch_5x, per discussion above.

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

If you can suggest a better name to what "leafy branch pruning" does, then at a minimum it could be expressed in the javadocs

++. Just 'prune' is probably more clear since its universally used all over data structures. We can add a javadoc comment that describes it in further detail if necessary.

if SpatialPrefixTree might have a better literature/industry based name then I'd love to know what that is.

There are trie based spatial trees all over (kd-Trie, kd-b-trie, buddy tree) industry and the literature. The one you call QuadPrefixTree was originally introduced in 1991 called the QuadTrie. (reference: Gaston H. Gonnet and Ricardo Baeza-Yates, Handbook of Algorithms and Data Structures – in Pascal and C, 2nd edition, Addison-Wesley, 1991.) Dr. Hanan Samet from UMD has a great section on MX and PR QuadTrees (same as QuadPrefixTree and a name someone mentioned to you in another issue). He provides a nice discussion on the differences between MX, PR and their point based counterparts (compared by the decomposition methods). There's certainly nothing wrong with an implementation specific name. If you are asking for suggestions then I offer: SpatialTrie, GeoHashTrie, QuadTrie as being shorter, more to the point, and probably more relate-able to other spatial SMEs (whom I'm hoping would be willing to get more involved).

It's not obvious to me but where in the code of PackedQuadCell are the 5 "depth bits" encoded & decoded?

PackedQuadCell.getLevel() decodes, and its encoded in PackedQuadCell.nextTerm()

Preferably it would stay enabled but I think you indicated it's not supported by PackedQuadTree? I didn't look closer.

That's correct.

I also wonder whether we need a new, lighter weight spatial module (spatial2? spatia_light?), or maybe spatial_sandbox, where the barrier is lower?

+++ I think this is a great idea for experimental/research features we don't want cluttering up the spatial module.

RE abstractions: I respect your opinion although I don't agree that there are too many here.

IMHO, this is a slippery slope. There are so many diverse spatial data structures we should be taking a look at for improving spatial search in higher order dimensions (4D space-time for just a start). That's a personal interest area for me, in how the most powerful high dimension structures (that already exist) can fit within the design and implementation of lucene-core (a green field to explore). Something like this does require a sophisticated abstraction framework and this particular one has a bit of a learning curve. I think that can work itself out over time with a bit of refactoring (which it sounds like all are open to?). In the meantime it does set the bar rather high for new contributors. This is another +1 for a spatial sandbox for experimental research (heck make it a separate repo).

Sigh; these conversations are stressfull

They're very verbose, but maybe that's the kick in the pants needed to help the spatial module really take off. That is, after all, the common goal of the community?

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I'll check out your patch tonight, tomorrow at the latest. Karl/Geo3d has kept me busy :-)

RE naming: in both cases it seems the current names actually aren't bad relative to your suggestions. "prune" is a suffix of "pruneLeafyBranches" (current name is more descriptive; in any case one would still need to look at the javadocs to understand), and SpatialTrie is synonymous with SpatialPrefixTree given that Trie and PrefixTree are synonyms. I'm +1 to rename these as you want to 6.x if you think it's worth it. There are back-compat issues with renaming them now. Again, we agree more javadocs (including suggested alternative names) to add clarification now would be great. I'll create a patch and seek your input.

RE sandbox: It's not clear to me what is really needed/useful. If someone comes along with some newfangled index/search spatial approach, it could go in the module and not hook into any existing interface... except a Lucene Query class, and something like a Lucene TokenStream/Field for indexing.

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Updated PackedQuadPrefixTree to iinclude optional leafyBranchPruning. Other minor code cleanup also included.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Latest patch:

I did some testing; notably running the fuzzy test with 10k iterations and temporarily set to test just the packed quad.

I think it's ready from my point of view but I'd like to get your input on my changes Nick.

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

LGTM. I've also attached a patch off trunk as requested. I'll have a look on LOE for deprecating LegacyCell, and LegacyPrefixTree along with some general code cleanup and javadoc clarity.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

My patch last night was also on trunk. Is there anything in your -TRUNK patch of interest or can I ignore it?

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Good deal. That was not clear so -TRUNK.patch is redundant and can be ignored. Nothing additional since yesterday's changes so I'm good as is for both trunk and _5x

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

New patch; minor changes. I dressed up the javadocs a bit, including rewriting the setPruneLeafyBranches description, especially on RPT. I moved the class javadocs from Cell to it's containing SPT level since user's generally don't look at SPT details (like the Cell) and just read the class javadocs. I includes notes on setting the prune option. I also included a modicum of Solr test & integration. That's basically it. In the CHANGES.txt I called out that prune should be set on this grid and not on RPT, to help prevent less than ideal use. Precommit is happy.

I plan to commit this tonight, subject to further feedback.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1675538 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1675538

LUCENE-6422: New spatial PackedQuadPrefixTree. Thanks Nick!

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1675539 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1675539

LUCENE-6422: New spatial PackedQuadPrefixTree. Thanks Nick!

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

BTW I did a little more benchmarking. For point data there's a nice 10% improvement or so. Smaller index size too. For non-point data I used pruning and a distErrPct of 0.025 (default). The recursive prune of RPT on legacy quad (does more pruning then packed quad) amounted to a 25% smaller index than PackedQuad, and the index & search were practically the same. I can only surmise PackedQuad required less memory while indexing, of course. I suspect that if RPT's recursive prune were to do its thing on PackedQuad, PackedQuad would have a better showing. I'll leave that benchmark to another day. My experience (& benchmarks) have shown that there is a high correlation between index size and search, and that pruning helps a ton. Some day I may write a better recursive prune that doesn't require buffering all tokens.

Thanks again for contributing this Nick!

asfimport commented 9 years ago

Nick Knize (@nknize) (migrated from JIRA)

Good to hear David, and thank you for committing the patch!

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1678008 from @dsmiley in branch 'dev/trunk' https://svn.apache.org/r1678008

Fix CHANGES.txt formatting for LUCENE-6422

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1678009 from @dsmiley in branch 'dev/branches/branch_5x' https://svn.apache.org/r1678009

Fix CHANGES.txt formatting for LUCENE-6422

asfimport commented 9 years ago

Anshum Gupta (@anshumg) (migrated from JIRA)

Bulk close for 5.2.0.