Esri / spatial-framework-for-hadoop

The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.
Apache License 2.0
369 stars 159 forks source link

ST_Transform() function unavailable #186

Open hjort opened 2 years ago

hjort commented 2 years ago

I noted that ST_Transform() function is still unavailable in the package. It is indeed a very useful and used function in GIS applications.

How hard is it to implement ST_Transform()?

ST_Transform

Regards, Hjort

randallwhitman commented 2 years ago

For projection from WGS84 to Web Mercator, and de-projection from Web Mercator to WGS84, only, probably moderate effort. For transformations and projections between dozens of datums, and hundreds of projections, a very large effort.

hjort commented 2 years ago

Sorry about the naïve question, but based on the approach Postgis employs (with a "spatial_ref_sys" table containing the available reference systems along with its guide parameters), isn't it possible to have a bunch of .properties or .json files with those SRS and an unique Java code to form a ST_Transform() function?

randallwhitman commented 2 years ago

Yes, some separation of code/logic from data/transform-definitions can be done. Careful design would be necessary about handling the transformation-data. Would it be stored at a location such that all processing nodes can download/side-load the data by URL? Would it be in JAR files that can be automatically distributed to processing nodes by Hadoop-Map-Reduce/Spark framework? Other?

hjort commented 2 years ago

Yeah, it really seems that it isn't a trivial task. :(

Postgis comes with a spatial_ref_sys table [1] populated with 3,000+ most common SRSs [2]. But, under the hood, it uses PROJ to execute these geospatial coordinates transformations [3,4], which is in C. Apparently there was an initiative to port it to Java, Proj4J, but its repo is untouched for at least 7 years [5].

1: https://postgis.net/docs/manual-1.4/ch04.html#spatial_ref_sys 2: https://epsg.org/home.html 3: https://proj.org/ 4: https://github.com/OSGeo/PROJ 5: https://github.com/Proj4J/proj4j

clumzzey commented 2 years ago

but since st_setsrid and st_srid are available it would be good to have the transformation available. It is the function I use very often and if I need a workaround (take it to python for example) for this, I could just as easy start doing all my spatial stuff there.

randallwhitman commented 2 years ago

The Spatial-framework-for-Hadoop is open-source and contributions are welcome.

You might also take a look at the links at - https://github.com/Esri/spatial-framework-for-hadoop#see-also

clumzzey commented 2 years ago

You make a fair point. My java skills are not good enough to do this so maybe I should have been just grateful for the things you do add to the table instead of complaining about what is missing. In my experience (n=1) transformations are like 90% from or to WGS84 format (4326). Would the solution become less complex if in first instance only from and to 4326 would be supported (st_tranformto4326 and st_transformfrom4326) ? If for each srid a seperate properties file is created like hjort suggests I would be happy to help along to create property files for the ones I need. Transformations between projections not being 4326 would be a 2-way step with the risk of rounding errors.
As for your earlier question, I would prefer the option "in JAR files that can be automatically distributed to processing nodes by Hadoop-Map-Reduce/Spark framework" since opening up a cluster to the internet might sometimes be a hassle.

Sorry for my n00bness if this is really a silly remark.

randallwhitman commented 2 years ago

Initial design discussed here separates properties data from code implementation. On the side of what the code would need to do, are two types of operations:

  1. Projection (and de-projection) between a geographic coordinate system (GCS) and projected coordinate systems (PCS) that use the same datum; and
  2. Transformation between different datums.

It would be less effort to implement only one of those, than to implement both. Implementation of either should be preceded by a complete design specification.