Closed jornfranke closed 11 months ago
I suspect that I run into memory / executor issue, because if each of the rows in the pointsDf is joined with the raster then it can become rather large (e.g. if I have 5 rows then I would need at least 5*uncompressed raster of memory) - is my assumption correct?
I could cut also outside of Sedona, e.g. using QGIS the raster into countries (i can also join on a country field in my table).
I tried to reproduce this problem using the workload you specified. However, the query failed because RS_Value
cannot handle rasters that large. RS_Value
tries to fetch all the pixel data on each call and it will fail on large rasters with the number of pixels exceeding Integer.MAX_VALUE
. I'll submit a patch to fix this problem.
java.lang.IllegalArgumentException: Requested region cannot be represented by a single Raster.
at javax.media.jai.PlanarImage.getData(PlanarImage.java:2163)
at javax.media.jai.RenderedOp.getData(RenderedOp.java:2276)
at javax.media.jai.remote.SerializableRenderedImage$TileServer.run(SerializableRenderedImage.java:662)
at java.base/java.lang.Thread.run(Thread.java:829)
Would you mind provide me the output of pointDf.explain()
, as well as the version of JAI in your environment? so that I can reproduce the performance issue. Nevertheless, currently Sedona is not designed to handle large rasters efficiently so cutting the large raster into smaller pieces may help get better performance.
Thanks for the reply, I will try to collect the data as soon as I have the chance
Here is the explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [id#14, country_iso#16, vld_frm#15, geometry#35, path_floodmap_EFAS_RP100_C#53, rs_value(raster_floodmap_EFAS_RP100_C#60, geometry#35, 1) AS raster_value_floodmap_EFAS_RP100_C#79]
+- CartesianProduct
:- Exchange hashpartitioning(id#14, 1200), REPARTITION_BY_COL, [id=#47]
: +- Project [id#14, country_iso#16, vld_frm#15, **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geometry#35]
: +- FileScan parquet [id#14,vld_frm#15,country_iso#16,longitude#20,latitude#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3a://bucket..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,vld_frm:string,country_iso:string,longitude:string,latitude:string>
+- Project [path#40 AS path_floodmap_EFAS_RP100_C#53, rs_fromgeotiff(content#43) AS raster_floodmap_EFAS_RP100_C#60]
+- FileScan binaryFile [path#40,content#43] Batched: false, DataFilters: [], Format: org.apache.spark.sql.execution.datasources.binaryfile.BinaryFileFormat@3e0c20ef, Location: InMemoryFileIndex(1 paths)[s3a://bucket..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<path:string,content:binary>
JAI version is the one that Sedona 1.4.1 is depending on.
I found some other optimization techniques as well. This does not solve this problem, but generally when working with Raster and vector data together - especially if you have heap errors: Reduce the parallelism!
Find here more information about what those settings mean.
It may sound strange and it is counterintuitive compared to normal Spark processing, but if one column is a big raster then it indeed becomes even faster to process. The reason is that one column is very large (the raster), while see others are usually very small.
This has helped me.
I think GDAL is the fastest way to handle raster data. Java can make use of JNI.
The performance is fine and just a few tweaks needed in Sedona. Since Sedona uses Spark one can achieve very good performance distributed over several nodes in a Spark clusters (especially with recent Java LTS releases >=17).
I would not prefer JNI - it is very difficult to use in a Spark cluster and introduces possibly security holes. Pure Java is perfectly fine then for most of the use cases.
The performance is fine and just a few tweaks needed in Sedona. Since Sedona uses Spark one can achieve very good performance distributed over several nodes in a Spark clusters (especially with recent Java LTS releases >=17).
I would not prefer JNI - it is very difficult to use in a Spark cluster and introduces possibly security holes. Pure Java is perfectly fine then for most of the use cases.
I see . The situation is satisfied.
Hi all, (1) RS_Values in Sedona 1.5.0 will perform partial data retrieval from the raster data. (2) Apache Sedona as an in-memory computation engine has to load the entire raster dataset in memory and then perform the raster and vector join. This will lead to huge data shuffle unfortunately. @wherobots 's SedonaDB provides the Havasu storage layer for out-db
raster mode which perfectly solves this problem. If you are interested, please take a look: https://docs.wherobots.services/1.2.0/references/havasu/raster/out-db-rasters/
Expected behavior
I have a spatial dataset with points which I load from a parquet file. Essentially it has an id, longitude and latitude. It does not matter really, even a small one with a few points (e.g. 5)
I join the spatial dataset with a Raster file Geotiff (https://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/FLOODS/EuropeanMaps/floodMap_RP100.zip , overview page: https://data.jrc.ec.europa.eu/dataset/1d128b6c-a4ee-4858-9e34-6210707f3c81) .
Then, I need to get the value of the raster for each point.
The code is the following
This should work in reasonable time.
Actual behavior
Even for very few points it take ages to get the value (> 10 min) on a very powerful cluster (although it is not even remotely consumed).
For other rasters (much smaller, < 2MB) this works perfectly reasonable fast - even for million of points.
Steps to reproduce the problem
See description above
Settings
Sedona version = 1.4.1
Apache Spark version = 3.2.0
Apache Flink version = not used
API type = Scala, Java, Python?
Scala version = 2.12
JRE version = 1.8
Python version = 3.9
Environment = Cloudera CDP