Esri / spatial-framework-for-hadoop

The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
Apache License 2.0
363 stars 160 forks source link

Spatial Index for improving performance #120

Open seamusdu opened 7 years ago

seamusdu commented 7 years ago

I am trying to use HiveContext within Spark to use this spatial framework and it does work. However, once I use a large dataset, it seems that the performance will decline dramatically. I am trying to count points within polygons. Hence, I wonder whether you have done any performance test, which can probably explain the performance of this framework. Also, have you ever considered creating a spatial index, which might improve the performance of spatial operations.

Thanks.

randallwhitman commented 7 years ago

Ideas and cross-references: https://github.com/Esri/spatial-framework-for-hadoop/issues/28 http://stackoverflow.com/questions/38963487/how-to-optimize-scan-of-1-huge-file-table-in-hive-to-confirm-check-if-lat-long/ http://gis.stackexchange.com/questions/178732/geospatial-queries-and-indexes-in-memory/ http://thunderheadxpler.blogspot.com/2013/10/bigdata-spatial-joins.html http://getindata.com/blog/post/geospatial-analytics-on-hadoop/ https://cwiki.apache.org/confluence/display/Hive/Spatial+queries https://github.com/Esri/spatial-framework-for-hadoop/issues/82

randallwhitman commented 7 years ago

If your polygon dataset can fit into memory, build an in-memory quadtree index on the polygons using the Geometry API, by adapting for Spark the MapReduce sample in the GIS-Tools-for-Hadoop.

seamusdu commented 7 years ago

Hi @randallwhitman

Thanks for your reply. The sample using quadtree index does help and I will try to use the Geometry API for Spark.

stevebuckingham commented 7 years ago

@seamusdu How did you find the running the Spatial Framework on Spark in the end, it is an option I'm looking at at the moment?

randallwhitman commented 7 years ago

Cross-reference re Spark: #97 (works with JsonSerde as of v1.2)

harryprince commented 5 years ago

@seamusdu I am doing the same thing and wrapper spatial join query with index in geospark R package.

guillemfrancisco commented 4 years ago

Has anyone tried to make a benchmarking with number of points and time that took to process them? Or even a comparison between Hive and MapReduce(with spatial indexing)?

randallwhitman commented 4 years ago

@guillemfrancisco There is a little bit of info in comment under - https://stackoverflow.com/questions/38963487/how-to-optimize-scan-of-1-huge-file-table-in-hive-to-confirm-check-if-lat-long