apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.86k stars 654 forks source link

try 1-N-N performance tuning with LATERAL subquery #1280

Open MyqueWooMiddo opened 5 months ago

MyqueWooMiddo commented 5 months ago

Expected behavior

reference to https://postgis.net/workshops/postgis-intro/knn.html

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-subquery.html

I upgrade spark to 3.5.1 , try LATERAL to calculate 1-N-N (1-Nearest-Neighbour)

I'll get point's 1-N-N inside the same table : data_points(id,longitude,latitude) ,use sedona

Actual behavior

spark do not support this type LATERAL

Steps to reproduce the problem

with t_data as ( select id ,st_point(longitude,latitude) as point from data_points order by 1 limit 1000 ) select * from t_data t1, lateral ( select t2.id,ST_DistanceSpheroid(t1.point,t2.point) as distance from t_data t2 where t1.id!=t2.id order by 2 limit 1 )

Spark throws : "org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationProject"

I just want to know How can optimize 1-N-N in a large dataset rather than row_number(order by distance) = 1

Settings

Sedona version = 1.5.1

Apache Spark version = 3.5.1

API type = Scala

Scala version = 2.12

JRE version = 1.8

Environment = Standalone

jiayuasu commented 5 months ago

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

MyqueWooMiddo commented 5 months ago

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

I think solution with iteral H3 of databricks Mosaic is a good idea.