databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
269 stars 66 forks source link

st_transform takes an extremely long time to finish #425

Open riccardodelega opened 1 year ago

riccardodelega commented 1 year ago

I am reading a 1MB snappy parquet file in Databricks within the DataFrame df. The following code takes a couple of seconds to terminate:

df.withColumn("geometry_27000", st_setsrid(st_geomfromwkt("geometry"), F.lit(27700)))

as expected. However, if I try to transform the column to a different EPSG code, it takes two minutes:

df.withColumn("geometry_27000", st_setsrid(st_geomfromwkt("geometry"), F.lit(27700))).withColumn("geometry_4326", st_transform("geometry_27000", F.lit(4326)))

Why does it take so long?

For comparison, on the same cluster and using the same Spark Session, sedona st_transform takes a couple of seconds as I would expect.

kyleries commented 1 year ago

I'm observing similar performance issues w/ st_transform. I have 78k rows @ 7.2mb over 8 files and it's taking ~2 minutes to complete.

Happy to help put together a more solid test case / profile - let me know what's most helpful toward that end!

oliverbeagley-pgg commented 1 year ago

I have the same problem, makes it unfeasible to use mosaic as it grinds to a halt on even small datasets

remibaar commented 9 months ago

I also have this issue. I checked with both JTS and ESRI, but there is no difference. Even that slow I won't be able to do the necessary transformations on my dataset. I want to transform about 750.000 polygons, and takes 55 minutes. GeoPandas processes the same data in a few seconds.

The CPU load is 100% on all 4 available cores. The data is spread over 7 files.

DBR: 12.2 Mosaic: 0.3.12 Conversion from EPSG:28992 to EPSG:4326

Mosiac libraries are loaded through the init script set-up

kyleries commented 9 months ago

Revisiting this thread - I am now working w/ a dataset that has 36M rows that all need to be transformed and I can't get it to finish (even w/ a very generously sized cluster and number of workers) and it seems the culprit is st_transform. If someone more familiar w/ the implementation in the project could advise on the potential root cause here or, a recommended solution, I would be very happy to tackle a PR w/ just a bit of guidance!

oliverbeagley-pgg commented 4 months ago

Did a bit of digging and I think I've found something that could potentially be the issue. Worth noting that I do not know scala/java very well so my interpretation could be wrong.

The mosaic geometry type appears to have the transformer constructed to do the reprojection for each geometry individually. The reason for this appears that it is to support that each geometry in the column can have its own crs/srid that are not necessarily the same. It doesn't appear that this transformer is cached or memoised in any way so a new transformer is created for each and every row. Again, I do not know java or scala very well, but the proj4j code that is used to do the transformation has the comment in the code of

// compute strategy for transformation at initialization time, to make transformation more efficient
// this may include precomputing sets of parameters
...
// NOTE: this method may be called many times, so needs to be as efficient as possible

which would tell me that it could be doing some expensive initialisation to make the method use cheap, something that mosaic may be completely skipping.

If my interpretation is correct could this be a source of how slow st_transform is?


https://github.com/databrickslabs/mosaic/blob/d17916f6e20b41a1fb4f83c8349a11b1b44c6b8b/src/main/scala/com/databricks/labs/mosaic/expressions/geometry/ST_Transform.scala#L36 https://github.com/databrickslabs/mosaic/blob/d17916f6e20b41a1fb4f83c8349a11b1b44c6b8b/src/main/scala/com/databricks/labs/mosaic/core/geometry/MosaicGeometryJTS.scala#L252 https://github.com/databrickslabs/mosaic/blob/d17916f6e20b41a1fb4f83c8349a11b1b44c6b8b/src/main/scala/com/databricks/labs/mosaic/core/geometry/MosaicGeometry.scala#L140

which then links to proj4j and

https://github.com/locationtech/proj4j/blob/59c2f665d0b7e85b05c94e9e22620da7c31618f7/core/src/main/java/org/locationtech/proj4j/BasicCoordinateTransform.java#L69