databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
280 stars 67 forks source link

`st_transform` duplicates the last point when transforming a polygon #457

Closed danielsparing closed 10 months ago

danielsparing commented 1 year ago

Describe the bug A valid polygon already has its last point equal to its first point. However, when running st_transform on a polygon, the last two points are identical, with the very last point being a repetition of the previous.

To Reproduce Steps to reproduce the behavior: Note that in the below example the oringinal wkt polygon has 4 vertices, so the WKT includes 5 points with the last being equal to the first; while the output transformed_geom has 6 points, with the first equal to the last two.

df = (
  spark.createDataFrame([{'wkt': 'POLYGON ((10 40, 40 30, 20 20, 30 10, 10 40))'}])
)
df = df.withColumn('transformed_geom',mos.st_astext(mos.st_transform(mos.st_setsrid(mos.st_geomfromwkt('wkt'), F.lit(4326)),F.lit(3857))))
df.show(1, False)
# +---------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |wkt                                          |transformed_geom                                                                                                                                                                                                                             |
# +---------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |POLYGON ((10 40, 40 30, 20 20, 30 10, 10 40))|POLYGON ((1113194.9079327357 4865942.279503176, 4452779.631730943 3503549.843504374, 2226389.8158654715 2273030.926987689, 3339584.723798207 1118889.9748579597, 1113194.9079327357 4865942.279503176, 1113194.9079327357 4865942.279503176))|
# +---------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Expected behavior Not adding a duplicate point.

Additional context

sllynn commented 1 year ago

Hi Daniel. Which geometry API are you using, JTS or ESRI?

danielsparing commented 1 year ago

Yeah good one, this only happens with JTS, not with ESRI.

sllynn commented 1 year ago

OK, it's this line. Needs a check for equality of the first and last points before deciding whether to append the first point to the end of the collection.

I'll open a PR shortly (unless you fancy giving it a go?)

danielsparing commented 1 year ago

I'll leave it to you, I don't understand Scala enough :)

However I was wondering, can't we completely leave out a check and just assume that no appending needed?

Because if the input comes e.g. from createLinearRing, then it already expects "a closed and simple linestring".

Finally, maybe this and this line is also related?

danielsparing commented 10 months ago

I would argue that we should never append, as only closed linestrings ("LinearRings") should be permissible as polygon shell or holes, and in the same spirit only point sequences where the last point equals the first. In other cases it is justified to fail, as it already fails:

(
    spark.createDataFrame([{"wkt": "POLYGON((0 0, 0 1, 1 0))"}])
    .withColumn("geom", mos.st_geomfromwkt("wkt"))
    .display()
)
# returns "IllegalArgumentException: Points of LinearRing do not form a closed linestring"

Also, already the Mosaic doc calls out that st_makepolygon requires closed linestrings

EDIT: ok fine, this breaks some raster tests, so despite the above arguments I am looking into the conditional version.