Open riccardodelega opened 1 year ago
I'm observing similar performance issues w/ st_transform
. I have 78k rows @ 7.2mb over 8 files and it's taking ~2 minutes to complete.
Happy to help put together a more solid test case / profile - let me know what's most helpful toward that end!
I have the same problem, makes it unfeasible to use mosaic as it grinds to a halt on even small datasets
I also have this issue. I checked with both JTS and ESRI, but there is no difference. Even that slow I won't be able to do the necessary transformations on my dataset. I want to transform about 750.000 polygons, and takes 55 minutes. GeoPandas processes the same data in a few seconds.
The CPU load is 100% on all 4 available cores. The data is spread over 7 files.
DBR: 12.2 Mosaic: 0.3.12 Conversion from EPSG:28992 to EPSG:4326
Mosiac libraries are loaded through the init script set-up
Revisiting this thread - I am now working w/ a dataset that has 36M rows that all need to be transformed and I can't get it to finish (even w/ a very generously sized cluster and number of workers) and it seems the culprit is st_transform
. If someone more familiar w/ the implementation in the project could advise on the potential root cause here or, a recommended solution, I would be very happy to tackle a PR w/ just a bit of guidance!
Did a bit of digging and I think I've found something that could potentially be the issue. Worth noting that I do not know scala/java very well so my interpretation could be wrong.
The mosaic geometry type appears to have the transformer constructed to do the reprojection for each geometry individually. The reason for this appears that it is to support that each geometry in the column can have its own crs/srid that are not necessarily the same. It doesn't appear that this transformer is cached or memoised in any way so a new transformer is created for each and every row. Again, I do not know java or scala very well, but the proj4j code that is used to do the transformation has the comment in the code of
// compute strategy for transformation at initialization time, to make transformation more efficient
// this may include precomputing sets of parameters
...
// NOTE: this method may be called many times, so needs to be as efficient as possible
which would tell me that it could be doing some expensive initialisation to make the method use cheap, something that mosaic may be completely skipping.
If my interpretation is correct could this be a source of how slow st_transform is?
https://github.com/databrickslabs/mosaic/blob/d17916f6e20b41a1fb4f83c8349a11b1b44c6b8b/src/main/scala/com/databricks/labs/mosaic/expressions/geometry/ST_Transform.scala#L36 https://github.com/databrickslabs/mosaic/blob/d17916f6e20b41a1fb4f83c8349a11b1b44c6b8b/src/main/scala/com/databricks/labs/mosaic/core/geometry/MosaicGeometryJTS.scala#L252 https://github.com/databrickslabs/mosaic/blob/d17916f6e20b41a1fb4f83c8349a11b1b44c6b8b/src/main/scala/com/databricks/labs/mosaic/core/geometry/MosaicGeometry.scala#L140
which then links to proj4j and
I am reading a 1MB snappy parquet file in Databricks within the DataFrame
df
. The following code takes a couple of seconds to terminate:as expected. However, if I try to transform the column to a different EPSG code, it takes two minutes:
Why does it take so long?
For comparison, on the same cluster and using the same Spark Session, sedona
st_transform
takes a couple of seconds as I would expect.