InitialDLab / Simba

Spatial In-Memory Big data Analytics
Apache License 2.0
121 stars 62 forks source link

polygon and point join [new feature] #75

Open merlintang opened 7 years ago

merlintang commented 7 years ago

Magellan project support the join between polygons and points, and join relationship can be inside, intersect or other. Users are interest in this feature, since this polygon can represent one county, and points can be the ubers, the join results can show which county is more visited and etc.

The current Simba does not support this, does any plan for this function? or any proposal or issue a JERA to track this feature?

dongx-psu commented 7 years ago

I think it is a crucial feature, and I try to take some time think about it.

The core problem in this context is actually how to partition a group of polygons. If there is no assumption on the data set, polygons can be very general and easily overflow the heap memory of an executor. Advanced load balancing strategy need to be designed.

Besides, current polygon utilities are relied on JTS 1.14, which probably too heavy to deal with something as simple as contains and intersect predicates.

merlintang commented 7 years ago

consider the USA county example, polygons are not heavy skew distributions and without strange polygons cover whole place , the load balance would not be that bad.

we can rewrite the contains and intersect functions as magellan.

dongx-psu commented 7 years ago

One thing I think we can do for this feature is implementing a predicate called intersects, and then do Cartesian product + filtering. It is not fast but at least it will work. What do you think?

merlintang commented 7 years ago

yes, we can do it via this way, but the cartesian product is too bad in the spark as magellan.

one thing come to mind is following. suppose we have two tables (1) outer table is the polygon (2) inner table is the points. when we do this join, (a) partition the space based on the inner table (b) use the partitioner of the step 1 to duplicate the outer table based on overlapping (c) zip operation and nest loop. (d) reduce step to filter the duplicate results. this is very basic, could be better than the cartesian.

ricosfeifei commented 7 years ago

Spatial join is a high priority on my ToDo list. once we are done with the trajectory feature that we are currently working on, we will start working on Spatial join (of arbitrary geometry shapes).

On Tue, Dec 20, 2016 at 12:01 PM, Mingjie Tang notifications@github.com wrote:

yes, we can do it via this way, but the cartesian product is too bad in the spark as magellan.

one thing come to mind is following. suppose we have two tables (1) outer table is the polygon (2) inner table is the points. when we do this join, (a) partition the space based on the inner table (b) use the partitioner of the step 1 to duplicate the outer table based on overlapping (c) zip operation and nest loop. (d) reduce step to filter the duplicate results. this is very basic, could be better than the cartesian.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/InitialDLab/Simba/issues/75#issuecomment-268327701, or mute the thread https://github.com/notifications/unsubscribe-auth/AEX3p4tKitGXGqCOj27C59vB2GlQwVzBks5rKCX_gaJpZM4LHMt9 .

merlintang commented 7 years ago

Great!

I also can come to this task later, because I have other emergency things to do it right now. Once I have done, I will ping you, then we can allocate the workload evenly.

On Tue, Dec 20, 2016 at 11:36 PM, Feifei Li notifications@github.com wrote:

Spatial join is a high priority on my ToDo list. once we are done with the trajectory feature that we are currently working on, we will start working on Spatial join (of arbitrary geometry shapes).

On Tue, Dec 20, 2016 at 12:01 PM, Mingjie Tang notifications@github.com wrote:

yes, we can do it via this way, but the cartesian product is too bad in the spark as magellan.

one thing come to mind is following. suppose we have two tables (1) outer table is the polygon (2) inner table is the points. when we do this join, (a) partition the space based on the inner table (b) use the partitioner of the step 1 to duplicate the outer table based on overlapping (c) zip operation and nest loop. (d) reduce step to filter the duplicate results. this is very basic, could be better than the cartesian.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/InitialDLab/Simba/issues/75#issuecomment-268327701, or mute the thread https://github.com/notifications/unsubscribe-auth/ AEX3p4tKitGXGqCOj27C59vB2GlQwVzBks5rKCX_gaJpZM4LHMt9 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/InitialDLab/Simba/issues/75#issuecomment-268456521, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXY-TSlQLt0vIgDKcDyUp_BoYpuyEacks5rKNcYgaJpZM4LHMt9 .

ricosfeifei commented 7 years ago

That sounds great! We expect to start working on this in early Spring.

On Wed, Dec 21, 2016 at 12:54 AM, Mingjie Tang notifications@github.com wrote:

Great!

I also can come to this task later, because I have other emergency things to do it right now. Once I have done, I will ping you, then we can allocate the workload evenly.

On Tue, Dec 20, 2016 at 11:36 PM, Feifei Li notifications@github.com wrote:

Spatial join is a high priority on my ToDo list. once we are done with the trajectory feature that we are currently working on, we will start working on Spatial join (of arbitrary geometry shapes).

On Tue, Dec 20, 2016 at 12:01 PM, Mingjie Tang <notifications@github.com

wrote:

yes, we can do it via this way, but the cartesian product is too bad in the spark as magellan.

one thing come to mind is following. suppose we have two tables (1) outer table is the polygon (2) inner table is the points. when we do this join, (a) partition the space based on the inner table (b) use the partitioner of the step 1 to duplicate the outer table based on overlapping (c) zip operation and nest loop. (d) reduce step to filter the duplicate results. this is very basic, could be better than the cartesian.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/InitialDLab/Simba/issues/75#issuecomment-268327701 , or mute the thread https://github.com/notifications/unsubscribe-auth/ AEX3p4tKitGXGqCOj27C59vB2GlQwVzBks5rKCX_gaJpZM4LHMt9 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/InitialDLab/Simba/issues/75#issuecomment-268456521, or mute the thread https://github.com/notifications/unsubscribe- auth/ABXY-TSlQLt0vIgDKcDyUp_BoYpuyEacks5rKNcYgaJpZM4LHMt9 .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/InitialDLab/Simba/issues/75#issuecomment-268458995, or mute the thread https://github.com/notifications/unsubscribe-auth/AEX3pwaCEdLdkyp3R6q6T4KN7ICS8b5Rks5rKNswgaJpZM4LHMt9 .

mithdrann commented 6 years ago

Hi, did you start working on this feature?. I am working on a comparative study of spatial data management tools based on the spatial join operation between points and polygons (within, contains, ...) and I would like to add Simba to this work.

dongx-psu commented 6 years ago

Sorry for the late response. We did not implement it yet in the system. We can implement a simple version use the same algorithms (which is most likely SJMR) adopted by other major systems. I believe the main difference on performance then will be on detailed system design choices.

mithdrann commented 6 years ago

Thank you for your answer. I agree with you, system design choices can make the difference. Even implementation choices can do it. It would be nice to have this simple version but I only have 3 weeks to finish this work. I wonder whether you could implement it before this deadline.