Closed firasomrane closed 1 year ago
I am not sure if I understand you correctly.
consider_boundary_intersection = False
using_index = False
points_a = [
GeoData(geom=Point(1.1, 2.2), userData="1"),
GeoData(geom=Point(10.0, 11.0), userData="2")
]
points_b = [
GeoData(geom=Point(3.3, 4.4), userData="100"),
GeoData(geom=Point(5.5, 6.6), userData="101"),
GeoData(geom=Point(7.7, 8.8), userData="101")
]
spatial_points_a = self.sc.parallelize(points_a).\
repartition(1)
spatial_points_b = self.sc.parallelize(points_b).\
repartition(1)
point_rdd_a = PointRDD(spatial_points_a)
point_rdd_b = PointRDD(spatial_points_b)
circle_rdd = CircleRDD(point_rdd_a, 10.0)
circle_rdd.analyze()
circle_rdd.spatialPartitioning(GridType.KDBTREE)
point_rdd_b.spatialPartitioning(circle_rdd.getPartitioner())
result = JoinQuery.DistanceJoinQueryFlat(point_rdd_b, circle_rdd, using_index, consider_boundary_intersection)
Adapter.toDf(result, self.spark).show()
Gives me that result
+---------------+---+---------------+---+
| _1| _2| _3| _4|
+---------------+---+---------------+---+
|POINT (1.1 2.2)| 1|POINT (3.3 4.4)|100|
|POINT (1.1 2.2)| 1|POINT (5.5 6.6)|101|
|POINT (1.1 2.2)| 1|POINT (7.7 8.8)|101|
| POINT (10 11)| 2|POINT (3.3 4.4)|100|
| POINT (10 11)| 2|POINT (5.5 6.6)|101|
| POINT (10 11)| 2|POINT (7.7 8.8)|101|
+---------------+---+---------------+---+
Can you provide exact RDDs or DataFrames which result with your case ?
@Imbruced Thanks for the response and the example. Using the same code you have put give me exactly the same unexpected behaviour.
from geospark.utils.spatial_rdd_parser import GeoData
from geospark.core.SpatialRDD import PointRDD
consider_boundary_intersection = False
using_index = False
points_a = [
GeoData(geom=Point(1.1, 2.2), userData="1"),
GeoData(geom=Point(10.0, 11.0), userData="2")
]
points_b = [
GeoData(geom=Point(3.3, 4.4), userData="100"),
GeoData(geom=Point(5.5, 6.6), userData="101"),
GeoData(geom=Point(7.7, 8.8), userData="101")
]
spatial_points_a = spark_session.sparkContext.parallelize(points_a).\
repartition(1)
spatial_points_b = spark_session.sparkContext.parallelize(points_b).\
repartition(1)
point_rdd_a = PointRDD(spatial_points_a)
point_rdd_b = PointRDD(spatial_points_b)
circle_rdd = CircleRDD(point_rdd_a, 10.0)
circle_rdd.analyze()
circle_rdd.spatialPartitioning(GridType.KDBTREE)
point_rdd_b.spatialPartitioning(circle_rdd.getPartitioner())
result = JoinQuery.DistanceJoinQueryFlat(point_rdd_b, circle_rdd, using_index, consider_boundary_intersection)
Adapter.toDf(result, spark_session).show()
The result is
+---------------+---+---------------+---+
| _1| _2| _3| _4|
+---------------+---+---------------+---+
|POINT (1.1 2.2)| 1|POINT (3.3 4.4)|100|
|POINT (1.1 2.2)| |POINT (5.5 6.6)|101|
|POINT (1.1 2.2)| |POINT (7.7 8.8)|101|
| POINT (10 11)| 2|POINT (3.3 4.4)| |
| POINT (10 11)| |POINT (5.5 6.6)| |
| POINT (10 11)| |POINT (7.7 8.8)| |
+---------------+---+---------------+---+
Which lacks the userData (ids) in some rows and includes it only once.
I updated the original comment about the spark version I use which is 2.4.4.
I was able to reproduce your issue with GeoSpark 1.3.1 but this issue no longer exists on master branch (version 1.3.2) which is not published on PyPi repo. Imho you can clone the repo and build package from source. Thank you for finding the bug !
I see now. Is there a way to add the distance to the result as a feature? Else how can I find new features/bug fixes like this one that are in progress or resolved, but without an issue being reported? Since I can't find it in the version 1.3.2 milestone.
Expected behavior
Want to do Distance Join Query between two dataframes. So followed the documentation
The expected result after adding
userData
andgeom
of each row is (all ids are found in the userData) and converting it to a Dataframe is:The schema of the first dataframe: root |-- geom1: geometry (nullable = false) |-- id1: string (nullable = true)
The schema of the second dataframe: root |-- geom2: geometry (nullable = false) |-- id2 string (nullable = true)
Actual behavior
But now when the query is done and the userData is added the result is:
So the result is good since the point with id1=1 is within the distance from 3 other points.
The problem is that once an id (userData) is used it won't be included for all the other rows in the userData. Like in the
id1=1
only available for the first row. The same forid2=100
.I solved this by doing a join on the geometry columns to get the expected result, but this include additional computation that can be avoided.
I think also that it would be better to have the distances in the result since I calculated in an additional step after getting the distance join result.
Steps to reproduce the problem
Do a distance join query between two RDDs like the documentation and tried to include the
userData
Settings
GeoSpark version = 1.3.1
Apache Spark version = 2.4.4
JRE version = 1.8
API type = Python