Getting values for points in a given raster takes very long time

jornfranke commented 1 year ago

Expected behavior

I have a spatial dataset with points which I load from a parquet file. Essentially it has an id, longitude and latitude. It does not matter really, even a small one with a few points (e.g. 5)

I join the spatial dataset with a Raster file Geotiff (https://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/FLOODS/EuropeanMaps/floodMap_RP100.zip , overview page: https://data.jrc.ec.europa.eu/dataset/1d128b6c-a4ee-4858-9e34-6210707f3c81) .

Then, I need to get the value of the raster for each point.

The code is the following

df = spark.read.parquet("path/to/dataset/with/longitude_latitude")
df.createOrReplaceTempView("pointDF")
df = spark.sql('SELECT id, ST_Point(CAST(longitude AS Decimal(24,20)), CAST(latitude AS Decimal(24,20))) as geometry FROM pointDF').withColumnRenamed("geometry","geometry_points")
pointDf = df.repartition("id")
rasterDf = spark.read.format("binaryFile").load("path/to/raster/floodmap_EFAS_RP100_C.tif")\
    .withColumn(f"raster", expr(f"RS_FromGeoTiff(content})"))
pointDf=pointDf.join(rasterDf)\
      .withColumn(f"raster_value",expr(f"RS_Value(raster,geometry_points)"))\
      .drop(f"raster",f"content")
pointDf.show(2)

This should work in reasonable time.

Actual behavior

Even for very few points it take ages to get the value (> 10 min) on a very powerful cluster (although it is not even remotely consumed).

For other rasters (much smaller, < 2MB) this works perfectly reasonable fast - even for million of points.

Steps to reproduce the problem

See description above

Settings

Sedona version = 1.4.1

Apache Spark version = 3.2.0

Apache Flink version = not used

API type = Scala, Java, Python?

Scala version = 2.12

JRE version = 1.8

Python version = 3.9

Environment = Cloudera CDP

jornfranke commented 1 year ago

I suspect that I run into memory / executor issue, because if each of the rows in the pointsDf is joined with the raster then it can become rather large (e.g. if I have 5 rows then I would need at least 5*uncompressed raster of memory) - is my assumption correct?

I could cut also outside of Sedona, e.g. using QGIS the raster into countries (i can also join on a country field in my table).

Kontinuation commented 1 year ago

I tried to reproduce this problem using the workload you specified. However, the query failed because RS_Value cannot handle rasters that large. RS_Value tries to fetch all the pixel data on each call and it will fail on large rasters with the number of pixels exceeding Integer.MAX_VALUE. I'll submit a patch to fix this problem.

java.lang.IllegalArgumentException: Requested region cannot be represented by a single Raster.
    at javax.media.jai.PlanarImage.getData(PlanarImage.java:2163)
    at javax.media.jai.RenderedOp.getData(RenderedOp.java:2276)
    at javax.media.jai.remote.SerializableRenderedImage$TileServer.run(SerializableRenderedImage.java:662)
    at java.base/java.lang.Thread.run(Thread.java:829)

Would you mind provide me the output of pointDf.explain(), as well as the version of JAI in your environment? so that I can reproduce the performance issue. Nevertheless, currently Sedona is not designed to handle large rasters efficiently so cutting the large raster into smaller pieces may help get better performance.

jornfranke commented 1 year ago

Thanks for the reply, I will try to collect the data as soon as I have the chance

jornfranke commented 1 year ago

Here is the explain

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [id#14, country_iso#16, vld_frm#15, geometry#35, path_floodmap_EFAS_RP100_C#53, rs_value(raster_floodmap_EFAS_RP100_C#60, geometry#35, 1) AS raster_value_floodmap_EFAS_RP100_C#79]
   +- CartesianProduct
      :- Exchange hashpartitioning(id#14, 1200), REPARTITION_BY_COL, [id=#47]
      :  +- Project [id#14, country_iso#16, vld_frm#15,  **org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geometry#35]
      :     +- FileScan parquet [id#14,vld_frm#15,country_iso#16,longitude#20,latitude#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3a://bucket..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,vld_frm:string,country_iso:string,longitude:string,latitude:string>
      +- Project [path#40 AS path_floodmap_EFAS_RP100_C#53, rs_fromgeotiff(content#43) AS raster_floodmap_EFAS_RP100_C#60]
         +- FileScan binaryFile [path#40,content#43] Batched: false, DataFilters: [], Format: org.apache.spark.sql.execution.datasources.binaryfile.BinaryFileFormat@3e0c20ef, Location: InMemoryFileIndex(1 paths)[s3a://bucket..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<path:string,content:binary>

JAI version is the one that Sedona 1.4.1 is depending on.

jornfranke commented 1 year ago

I found some other optimization techniques as well. This does not solve this problem, but generally when working with Raster and vector data together - especially if you have heap errors: Reduce the parallelism!

Set your executor cores to a low value (e.g. 1): spark.executor.cores
Provide sufficient amount of memory to an executor: spark.executor.memory
Set shuffle partitions to a very low number (e.g. 50-100): spark.sql.shuffle.partitions

Find here more information about what those settings mean.

It may sound strange and it is counterintuitive compared to normal Spark processing, but if one column is a big raster then it indeed becomes even faster to process. The reason is that one column is very large (the raster), while see others are usually very small.

This has helped me.

MyqueWooMiddo commented 1 year ago

I think GDAL is the fastest way to handle raster data. Java can make use of JNI.

jornfranke commented 1 year ago

The performance is fine and just a few tweaks needed in Sedona. Since Sedona uses Spark one can achieve very good performance distributed over several nodes in a Spark clusters (especially with recent Java LTS releases >=17).

I would not prefer JNI - it is very difficult to use in a Spark cluster and introduces possibly security holes. Pure Java is perfectly fine then for most of the use cases.

MyqueWooMiddo commented 1 year ago

The performance is fine and just a few tweaks needed in Sedona. Since Sedona uses Spark one can achieve very good performance distributed over several nodes in a Spark clusters (especially with recent Java LTS releases >=17).

I would not prefer JNI - it is very difficult to use in a Spark cluster and introduces possibly security holes. Pure Java is perfectly fine then for most of the use cases.

I see . The situation is satisfied.

jiayuasu commented 11 months ago

Hi all, (1) RS_Values in Sedona 1.5.0 will perform partial data retrieval from the raster data. (2) Apache Sedona as an in-memory computation engine has to load the entire raster dataset in memory and then perform the raster and vector join. This will lead to huge data shuffle unfortunately. @wherobots 's SedonaDB provides the Havasu storage layer for out-db raster mode which perfectly solves this problem. If you are interested, please take a look: https://docs.wherobots.services/1.2.0/references/havasu/raster/out-db-rasters/

apache / sedona