Open-EO / openeo-geotrellis-extensions

Java/Scala extensions for Geotrellis, for use with OpenEO GeoPySpark backend.
Apache License 2.0
5 stars 3 forks source link

apply_neighborhood/regrid: rdd size/partition inflation #191

Closed jdries closed 1 year ago

jdries commented 1 year ago

It seems that regridding from a large tile size (e.g. 10224 or 512) to a small size (e.g. 64) results in very large rdd/partition sizes, which is unexpected.

My theory is that the 'crop' method used in Regrid: https://github.com/locationtech/geotrellis/blob/d65d6a22eb70efd96caa5c6f5f660b2b936b2763/spark/src/main/scala/geotrellis/spark/regrid/Regrid.scala#L122 Is a lazy crop, which keeps the original array instead of copying the smaller chunk of data out of the larger one. So when Spark serializes the rdd, it also copies over all of the larger arrays backing the cropped types, inflating the data a lot.

Image

jdries commented 1 year ago

Committed a fix for this one, would still like to open a PR in geotrellis.

jdries commented 1 year ago

closing this one, a PR has been opened!