This uses JTS CoordinateSequences for performance, including a VirtualCoordinateSequence that prevents re-allocation while line segments are being built up (this needs a bit of cleanup / extraction out of osmesa.common.functions.osm).
After flirting with UDAFs to remove a shuffle relative to a UDF + collection aggregation, I discovered that mapPartitions (after repartitioning to ensure that all geometries for a version of a relation are together) was much more efficient than buffering + merging lists of WKB byte arrays (this was producing lots of GC activity). I think Spark's rowIterator is lazy, so avoiding the groupBy in their and flatMaping instead (feeding another lazy iterator consumed by shuffle output handling) might help. There's a commented-out version in MultiPolygonRelationReconstructionSpec.
Some types are downsampled (BigDecimal → float (instead of double) for coordinates, Long → Int for version numbers) in order to reduce intermediate representation sizes where we know they'll fit in the target type without precision loss.
This uses JTS
CoordinateSequence
s for performance, including aVirtualCoordinateSequence
that prevents re-allocation while line segments are being built up (this needs a bit of cleanup / extraction out ofosmesa.common.functions.osm
).After flirting with UDAFs to remove a shuffle relative to a UDF + collection aggregation, I discovered that
mapPartitions
(after repartitioning to ensure that all geometries for a version of a relation are together) was much more efficient than buffering + merging lists of WKB byte arrays (this was producing lots of GC activity). I think Spark'srowIterator
is lazy, so avoiding thegroupBy
in their andflatMap
ing instead (feeding another lazy iterator consumed by shuffle output handling) might help. There's a commented-out version inMultiPolygonRelationReconstructionSpec
.Some types are downsampled (
BigDecimal
→float
(instead ofdouble
) for coordinates,Long
→Int
for version numbers) in order to reduce intermediate representation sizes where we know they'll fit in the target type without precision loss.