azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
80 stars 26 forks source link

Support for MultiPolygon and Route relations #63

Closed mojodna closed 6 years ago

mojodna commented 6 years ago

This uses JTS CoordinateSequences for performance, including a VirtualCoordinateSequence that prevents re-allocation while line segments are being built up (this needs a bit of cleanup / extraction out of osmesa.common.functions.osm).

After flirting with UDAFs to remove a shuffle relative to a UDF + collection aggregation, I discovered that mapPartitions (after repartitioning to ensure that all geometries for a version of a relation are together) was much more efficient than buffering + merging lists of WKB byte arrays (this was producing lots of GC activity). I think Spark's rowIterator is lazy, so avoiding the groupBy in their and flatMaping instead (feeding another lazy iterator consumed by shuffle output handling) might help. There's a commented-out version in MultiPolygonRelationReconstructionSpec.

Some types are downsampled (BigDecimalfloat (instead of double) for coordinates, LongInt for version numbers) in order to reduce intermediate representation sizes where we know they'll fit in the target type without precision loss.