apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.87k stars 658 forks source link

Distance Join Query result using SpatialRDDs is sparse (doesn't contain all userData) and doesn't contain the distance #464

Closed firasomrane closed 1 year ago

firasomrane commented 4 years ago

Expected behavior

Want to do Distance Join Query between two dataframes. So followed the documentation

The expected result after adding userData and geom of each row is (all ids are found in the userData) and converting it to a Dataframe is:

|  geom1                | id1                       | geom2                | id2      |
--------------------------------------------------------------------------------------------
  POINT(1.1 2.2)        | 1                         |  POINT(3.3 4.4)      | 100      |
  POINT(1.1 2.2)        | 1                         |  POINT(5.5 6.6)      | 101      |
  POINT(1.1 2.2)        | 1                         |  POINT(7.7 8.8)      | 101      |
  POINT(10 11)          | 2                         |  POINT(3.3 4.4)      | 100      |

Actual behavior

But now when the query is done and the userData is added the result is:

|  geom1                | id1                       | geom2                | id2      |
--------------------------------------------------------------------------------------------
  POINT(1.1 2.2)        | 1                         |  POINT(3.3 4.4)      | 100      |
  POINT(1.1 2.2)        |                           |  POINT(5.5 6.6)      | 101      |
  POINT(1.1 2.2)        |                           |  POINT(7.7 8.8)      | 101      |
  POINT(10 11)          | 2                         |  POINT(3.3 4.4)      |          |

So the result is good since the point with id1=1 is within the distance from 3 other points.

The problem is that once an id (userData) is used it won't be included for all the other rows in the userData. Like in the id1=1 only available for the first row. The same for id2=100.

I solved this by doing a join on the geometry columns to get the expected result, but this include additional computation that can be avoided.

I think also that it would be better to have the distances in the result since I calculated in an additional step after getting the distance join result.

Steps to reproduce the problem

Do a distance join query between two RDDs like the documentation and tried to include the userData

Settings

GeoSpark version = 1.3.1

Apache Spark version = 2.4.4

JRE version = 1.8

API type = Python

Imbruced commented 4 years ago

I am not sure if I understand you correctly.

consider_boundary_intersection = False
using_index = False

points_a = [
    GeoData(geom=Point(1.1, 2.2), userData="1"),
    GeoData(geom=Point(10.0, 11.0), userData="2")
]
points_b = [
    GeoData(geom=Point(3.3, 4.4), userData="100"),
    GeoData(geom=Point(5.5, 6.6), userData="101"),
    GeoData(geom=Point(7.7, 8.8), userData="101")
]

spatial_points_a = self.sc.parallelize(points_a).\
    repartition(1)

spatial_points_b = self.sc.parallelize(points_b).\
    repartition(1)

point_rdd_a = PointRDD(spatial_points_a)
point_rdd_b = PointRDD(spatial_points_b)

circle_rdd = CircleRDD(point_rdd_a, 10.0)
circle_rdd.analyze()

circle_rdd.spatialPartitioning(GridType.KDBTREE)
point_rdd_b.spatialPartitioning(circle_rdd.getPartitioner())

result = JoinQuery.DistanceJoinQueryFlat(point_rdd_b, circle_rdd, using_index, consider_boundary_intersection)
Adapter.toDf(result, self.spark).show()

Gives me that result

+---------------+---+---------------+---+
|             _1| _2|             _3| _4|
+---------------+---+---------------+---+
|POINT (1.1 2.2)|  1|POINT (3.3 4.4)|100|
|POINT (1.1 2.2)|  1|POINT (5.5 6.6)|101|
|POINT (1.1 2.2)|  1|POINT (7.7 8.8)|101|
|  POINT (10 11)|  2|POINT (3.3 4.4)|100|
|  POINT (10 11)|  2|POINT (5.5 6.6)|101|
|  POINT (10 11)|  2|POINT (7.7 8.8)|101|
+---------------+---+---------------+---+

Can you provide exact RDDs or DataFrames which result with your case ?

firasomrane commented 4 years ago

@Imbruced Thanks for the response and the example. Using the same code you have put give me exactly the same unexpected behaviour.

from geospark.utils.spatial_rdd_parser import GeoData
from geospark.core.SpatialRDD import PointRDD

consider_boundary_intersection = False
using_index = False

points_a = [
    GeoData(geom=Point(1.1, 2.2), userData="1"),
    GeoData(geom=Point(10.0, 11.0), userData="2")
]
points_b = [
    GeoData(geom=Point(3.3, 4.4), userData="100"),
    GeoData(geom=Point(5.5, 6.6), userData="101"),
    GeoData(geom=Point(7.7, 8.8), userData="101")
]

spatial_points_a = spark_session.sparkContext.parallelize(points_a).\
    repartition(1)

spatial_points_b = spark_session.sparkContext.parallelize(points_b).\
    repartition(1)

point_rdd_a = PointRDD(spatial_points_a)
point_rdd_b = PointRDD(spatial_points_b)

circle_rdd = CircleRDD(point_rdd_a, 10.0)
circle_rdd.analyze()

circle_rdd.spatialPartitioning(GridType.KDBTREE)
point_rdd_b.spatialPartitioning(circle_rdd.getPartitioner())

result = JoinQuery.DistanceJoinQueryFlat(point_rdd_b, circle_rdd, using_index, consider_boundary_intersection)
Adapter.toDf(result, spark_session).show()

The result is

+---------------+---+---------------+---+
|             _1| _2|             _3| _4|
+---------------+---+---------------+---+
|POINT (1.1 2.2)|  1|POINT (3.3 4.4)|100|
|POINT (1.1 2.2)|   |POINT (5.5 6.6)|101|
|POINT (1.1 2.2)|   |POINT (7.7 8.8)|101|
|  POINT (10 11)|  2|POINT (3.3 4.4)|   |
|  POINT (10 11)|   |POINT (5.5 6.6)|   |
|  POINT (10 11)|   |POINT (7.7 8.8)|   |
+---------------+---+---------------+---+

Which lacks the userData (ids) in some rows and includes it only once.

I updated the original comment about the spark version I use which is 2.4.4.

Imbruced commented 4 years ago

I was able to reproduce your issue with GeoSpark 1.3.1 but this issue no longer exists on master branch (version 1.3.2) which is not published on PyPi repo. Imho you can clone the repo and build package from source. Thank you for finding the bug !

firasomrane commented 4 years ago

I see now. Is there a way to add the distance to the result as a feature? Else how can I find new features/bug fixes like this one that are in progress or resolved, but without an issue being reported? Since I can't find it in the version 1.3.2 milestone.