duckdb / duckdb_spatial

MIT License
492 stars 41 forks source link

Feature: Spherical calculations #16

Open Maxxen opened 1 year ago

Maxxen commented 1 year ago

While I'm hesitant to implement a full blown GEOGRAPHY type, I do think that having a couple of spatial functions using geodetic algorithms operating on a sphere would be a good complexity/usefulness compromise.

blackrez commented 1 year ago

I don't if it is help, but actually I used the rust port of https://geographiclib.sourceforge.io/C++/doc/index.html for geodesic calculation.

bfgdigital commented 1 year ago

I'm currently trying to replicate the following

ST_DistanceSphere(
        ST_MakePoint(lon, lat),
        ST_MakePoint(last_lon, last_lat)
 ) AS distance_m,

with

ST_Distance(
        ST_Point(lon, lat),
        ST_Point(last_lon, last_lat)
) AS distance_m,

but not being a spatial guy, I'm unsure of the implications of this. From what I've discerned I would require these Spherical calculations to provide the 6370986 meter radius effect on the measurements between the two points.

this sounds like a pretty fundamental requirement of geo data so I'm adding this here in hope that it isn't required, but perhaps does require spherical calculations.

Maxxen commented 1 year ago

To clarify, all the algorithms implemented as of now assume a planar coordinate system. This makes a lot of algorithms much easier and faster to implement and works well when you are processing geospatial data that lies within a small-ish region, say e.g. on the scale of a country or two. Once you reach a scale where the curvature of the earth becomes important, then yeah, measurements like distances and areas become extremely inexact if implemented with straight forward cartesian geometry . The PostGIS documentation has a great example of this.

Note that in general we don't really know the exact shape of the earth, so there is really three different ways to approximate geospatial calculations with progressively increasing difficulty, performance overhead and precision.

  1. Project into a planar coordinate space. This is what we currently achieve with GEOS and PROJ.
  2. Model the earth as a perfect sphere. This is what e.g. Googles S2 library does (and I think BigQuery?).
  3. Model the earth as an ellipsoid. This is what the algorithms in GeographicLib does.

Like PostGIS it would be cool if we could support all three to some degree, but a lot of the more advanced algorithms (e.g. polygon intersection) are already extremely complicated in the planar case, so I'd like to take this in small steps.

That said, I have something in the works using Geographiclib for the elementary algorithms (distance, length, area). But I think we should implement and focus on 2. in the future for a potential GEOGRAPHY type.

Maxxen commented 1 year ago

Partially implemented in #74

keltia commented 9 months ago

Guys guys... #74 was merged in May 2023 but the documentation was never updated so I just discovered that they are in the current extension!!!

D select ST_Distance_spheroid(ST_Point(49.60, 6.12), ST_Point(49.60, 6.14));
┌────────────────────────────────────────────────────────────────────┐
│ st_distance_spheroid(st_point(49.60, 6.12), st_point(49.60, 6.14)) │
│                               double                               │
├────────────────────────────────────────────────────────────────────┤
│                                                  1445.776772642289 │
└────────────────────────────────────────────────────────────────────┘
sonjoonho commented 8 months ago

Hi – it's exciting to see that this extension is still being actively developed! Almost a year on since this roadmap item was opened, I was wondering if the calculation of geographic/spheroid distances between polygons and points was on the horizon? Otherwise it looks like falling back to PostGIS is my best bet (but open to other suggestions).

Line segment <-> point and the additional combinations that follow require an iterative algorithm that I've left out for now.

I would also be curious to know what this algorithm involves. Thanks!

Maxxen commented 8 months ago

Time flies! Ive recently completed a pretty major overhaul of the internal geometry representation used during execution so im planning on going back and adding the remaining missing measurement functions, including geodetic/haversine based ops, in the near future.

Im currently working on refactoring the docs and will try to shape up and share more about the roadmap for my plans for 2024 as well.

jtbaker commented 4 months ago

DuckDB version == v1.0.0

Thanks for your contributions on this! I think I've identified a possible issue where the x and y coordinates are getting transposed for each other in one of the checks. If you go past -90 (min latitude but not min longitude) on the x coordinate (longitudinal) plane, the calculations seem to stop working

SELECT ST_AREA_SPHEROID(ST_ENVELOPE(ST_BUFFER(ST_POINT(-89.999, 30), .0001)));

st_area_spheroid(st_envelope(st_buffer(st_point(-89.999, 30), .0001))) │
│                                 double                                 │
├────────────────────────────────────────────────────────────────────────┤
│                                                   0.008709549903869629 

^ this works as expected, while incrementing up to from -89.999 to -90 for the longitudinal coordinate returns a nan with everything else the same

SELECT ST_AREA_SPHEROID(ST_ENVELOPE(ST_BUFFER(ST_POINT(-90, 30), .0001)));

 st_area_spheroid(st_envelope(st_buffer(st_point(-90, 30), .0001))) │
│                               double                               │
├────────────────────────────────────────────────────────────────────┤
│                                                                nan 
mtravis commented 4 months ago

@Maxxen I'm also seeing the error above.

overturemaps download -f geoparquet --bbox -71.063432,42.301047,-71.009501,42.382221 -o boston.parquet -t building

D SELECT
  id,
  ST_AREA(ST_GeomFromWKB(geometry)) as area,
  ST_Area_Spheroid(ST_geomFromWKB(geometry)) AS cartesian_area
  FROM 'boston.parquet' limit 5;
┌──────────────────────────────────┬────────────────────────┬───────────────────┐
│                id                │          area          │  cartesian_area   │
│             varchar              │         double         │      double       │
├──────────────────────────────────┼────────────────────────┼───────────────────┤
│ 08b2a3066bd41fff0200fc6ab276c90d │  1.285837500012519e-08 │  51.9845110103488 │
│ 08b2a3066bd4cfff0200bc057338d4ed │ 1.2229789998652358e-08 │ 49.44350370578468 │
│ 08b2a3066bd41fff020028311e9c371c │ 1.5778415000042524e-08 │  63.7901326790452 │
│ 08b2a3066bd4cfff02008dcf88943bb9 │ 1.1989005000108085e-08 │ 48.47045111026091 │
│ 08b2a3066bd4cfff020002be80094552 │ 2.0510755000441452e-08 │ 82.92415748536587 │

overturemaps download -f geoparquet --bbox -122.365604,47.573155,-122.307926,47.623162 -o seattle.parquet -t building

D SELECT
  id,
  ST_AREA(ST_GeomFromWKB(geometry)) as area,
  ST_Area_Spheroid(ST_geomFromWKB(geometry)) AS cartesian_area
  FROM 'seattle.parquet' limit 5;
┌──────────────────────────────────┬────────────────────────┬────────────────┐
│                id                │          area          │ cartesian_area │
│             varchar              │         double         │     double     │
├──────────────────────────────────┼────────────────────────┼────────────────┤
│ 08b28d555096bfff0200ba6e244e17a4 │   8.85314630001753e-07 │            nan │
│ 08b28d5550948fff0200b8ce7ac64e3b │ 1.0191719999341135e-08 │            nan │
│ 08b28d555095efff02004bd14aeeb6be │  9.111243699996493e-07 │            nan │
│ 08b28d5550958fff0200eb9c47a75fc9 │  2.090709999588343e-09 │            nan │
│ 08b28d5550855fff02007c80ff8ad9f0 │  2.003438000002535e-08 │            nan │
└──────────────────────────────────┴────────────────────────┴────────────────┘
kentstephen commented 4 months ago

I just want to weigh in here that I believe this is the root of an issue I am having with Overture Maps, specifically with:

ST_Transform(ST_GeomFromWKB(geometry), 'EPSG:4326', 'EPSG:3857') as reprojected_geometry

I then calculate the ST_Area(reprojected_geometry)

When I run that on the whole world I get NULL values for a good chunk of it:

image

I am a fairly novice, self taught developer - I don't understand the specific details of this kind of thing, but it's my understanding that ST_Transform uses spherical calculations.

Maxxen commented 2 months ago

Hi! There's a couple of things at play here.

ST_Transform defaults to respecting the axis order defined by the CRS authority. EPSG:4326 is defined as lat/lon, while 'EPSG:3857' is defined as lon/lat. Meaning that when transforming 'EPSG:4326 to EPSG:3857 it is expected that the input is already in lat/lon axis order.

You can pass a boolean as an optional third argument to force st_transform to always expect [lon/lat] axis order regardless of what projection you use. You can also use st_flipcoordinates to adjust the axis order manually. The new docs are still a bit rough around the edges but I've attempted to document this better here: . https://github.com/duckdb/duckdb_spatial/blob/main/docs/functions.md#st_transform

Additionally, while most functions in the spatial extension is ambivalent to the axis order or projection (as they almost all work in planar space) the _spheroid functions expect their input to be in lat/lon EPSG:4326 (WGS84).

This is a common cause of confusion, see e.g: https://github.com/duckdb/duckdb_spatial/issues/222, https://github.com/duckdb/duckdb_spatial/issues/219 https://github.com/duckdb/duckdb_spatial/issues/215 https://github.com/duckdb/duckdb_spatial/issues/179 https://github.com/duckdb/duckdb_spatial/issues/63

I've elaborated more on this both on discord and in the issues reported above. But in short:

Most GIS tools tend to always work with a fixed axis order. Most FOSS GIS tools prefer lon/lat (x/y) while others (especially in the GPS/radio-space) prefer lat/lon. Around the time DuckDB spatial was created it seemed like to me that the big libraries in the open source GIS world such as GDAL and PROJ had recently made an attempt to actually start respecting the CRS authority instead of always expecting lon/lat. See e.g. GDAL axis order post 3.0 and PROJ's FAQ. Therefore the approach taken in DuckDB spatial was to simply stay ambivalent to the axis order used. As we don't store SRID's in each geometry anyway it made sense to me that the user would ultimately be responsible for knowing what coordinate system and axis order they worked within.

Unfortunately it seems like the idea of respecting the axis order defined by the CRS authority never really caught on in practice. Modern specs such as GeoJSON uses geographic coordinates but mandates lon/lat anyway, and even brand new specs such as GeoParquet require lon/lat always.

Judging by the number of issues opened by users confused by this, I am definitely ready to change this behavior so that we also normalize all coordinates into easting/northing (e.g. lon/lat) when ingested by duckdb internally. However, this will require a bit of work as we then would need some way to track the projection information for each geometry column whenever we import/export data from duckdb spatial so that we can flip the coordinates back from/to what the CRS mandates when data enters or leaves spatial.

I have some ideas on how to do this but it is going to be part of a larger overhaul I've got planned for how we handle projection information in spatial, so I don't have a timeline for when this behavior will change.

prusswan commented 2 months ago

ST_Transform defaults to respecting the axis order defined by the CRS authority. EPSG:4326 is defined as lat/lon, while 'EPSG:3857' is defined as lon/lat. Meaning that when transforming 'EPSG:4326 to EPSG:3857 it is expected that the input is already in lat/lon axis order.

To further complicate matters (and mentioned further down the post), it is also "expected" that GeoJSON input sources will be in both EPSG:4326 and X-Y (or Lon-Lat) order, and current behavior and expectation in duckdb means always_xy must always be used for 4326 -> 3857 conversion of GeoJSON data when imported into DuckDB.

As we don't store SRID's in each geometry anyway it made sense to me that the user would ultimately be responsible for knowing what coordinate system and axis order they worked within. However, this will require a bit of work as we then would need some way to track the projection information for each geometry column whenever we import/export data from duckdb spatial so that we can flip the coordinates back from/to what the CRS mandates when data enters or leaves spatial.

As a user, I think it is okay to not track the projection/SRID, provided that DuckDB does not change the axis order without explicit user intent/awareness. By that I mean, the output of any function should follow the axis order of the input (unless the user chooses otherwise).

Maxxen commented 2 months ago

means always_xy must always be used for 4326 -> 3857 conversion of GeoJSON data when imported into DuckDB.

Not necessarily, you could transform OGC:CRS84 -> EPSG:3857 instead in which case you won't need it, as OGC:CRS84 is the same as WGS84 but with a lon/lat axis order.

provided that DuckDB does not change the axis order without explicit user intent/awareness. By that I mean, the output of any function should follow the axis order of the input (unless the user chooses otherwise).

Thats how it currently works. The only function that can change the axis order is ST_Transform (and I guess st_flipcoordinates), and it does so by trusting that the user supplies (and expects) geometry in the axis order defined by the given source and target CRS.

prusswan commented 2 months ago

Not necessarily, you could transform OGC:CRS84 -> EPSG:3857 instead in which case you won't need it, as OGC:CRS84 is the same as WGS84 but with a lon/lat axis order.

Refer to the screenshot: https://github.com/duckdb/duckdb_spatial/issues/211#issuecomment-2316275955. St_Xmax ends up picking up the latitude values (which is problematic for me, although I probably made a wrong flip in between to avoid the infinity values).

The only function that can change the axis order is ST_Transform (and I guess st_flipcoordinates), and it does so by trusting that the user supplies (and expects) geometry in the axis order defined by the given source and target CRS.

I guess this is the part where there is a conflict in expectations - I don't expect DuckDB to be having any expectations regarding CRS, when DuckDB is not keeping track of CRS (or "CRS-blind"). Of course, this is not a statement on which of these expectations are more justified, just a sharing of user sentiment.

jtbaker commented 2 months ago

Thanks for the explanation @Maxxen!

Additionally, while most functions in the spatial extension is ambivalent to the axis order or projection (as they almost all work in planar space) the _spheroid functions expect their input to be in lat/lon EPSG:4326 (WGS84).

This is the biggest point of confusion for me. If the EPSG:4326 spec mandates long/lat, why would the spheroid logic require flipping the coordinates to be the opposite of what they are expected to be in? Feels like a very small minority of users, if any would ever get the usage of this right.

If I'm following the PostGIS convention, which I remember reading the spatial extension is modeled after, if I just wanted to inline the selection of a point via ST_POINT, the only way to get the "correct" result for a query of the area of a buffer of it would be with an invocation of ST_FLIPCOORDINATES which feels very unergonomic.

SELECT
    ST_AREA_SPHEROID(
        ST_BUFFER(
            ST_FLIPCOORDINATES(ST_POINT(-90,30)),
        .00001
        )
    )
Maxxen commented 2 months ago

I don't expect DuckDB to be having any expectations regarding CRS, when DuckDB is not keeping track of CRS (or "CRS-blind").

And it generally doesn't. You have to supply both a "source" and a "target" SRID when transforming a geometry, and DuckDB will treat the geometry as having the axis order as defined by the source CRS. That is, if you transform ST_Transfrom(st_point(1,2), 'EPSG:4326', 'EPSG:3857') the input is expected to be in lat/lon and the output will be in lon/lat. If you instead do ST_Transform(st_point(1,2), 'OGC:CRS84', 'EPSG:3857') the input is expected to be in lon/lat and the output will still be lon/lat.

The only exception to this rule is the _spheroid functions, as the axis order is actually significant for the result and the algorithm is hardcoded for the WGS84 datum. Here we chose again to go with respecting the axis order as defined by the authority, that is __lat/lon__. It would be even more inconsistent if we went with the crs-mandated axis order when projecting, but not in the _spheroid functions, or vice versa.

This is the biggest point of confusion for me. If the EPSG:4326 spec mandates long/lat, why would the spheroid logic require flipping the coordinates to be the opposite of what they are expected to be in?

@jtbaker you still got it the wrong way around, EPSG:4326 mandates lat/lon. PostGIS convention is to not respect the CRS authority and always assumes all geometries to be in lon/lat. But they also "track" the CRS of each geometry value so that you don't have to supply a source SRID when transforming.


Just to reiterate, Im fully on board with changing the current behavior because, for better or worse, a lot of users are obviously not used to or do not expect to consider that different coordinate systems are standardized to have different axis-orders. The convention for GIS tools to normalize geometries so that x corresponds to the "easting" and y "northing" is very strong - and that's not necessarily a bad thing. We are not completely alone in this though, pyproj also has a always_xy parameter.

And chances are that more systems will default to respecting the axis order defined by the spec in the future, that is, if PROJ manages to "[...] spearhead this effort, hopefully setting a good example for the rest of the geospatial industry."

Im just trying to contextualize why the current behavior is what it is, why it is not a bug, and why I think it makes the most sense considering our approach of being "blind" to the CRS in general (i.e. not implicit tracking of geometry columns, no normalization/coordinate flipping on import/export).

prusswan commented 2 months ago

And it generally doesn't. You have to supply both a "source" and a "target" SRID when transforming a geometry, and DuckDB will treat the geometry as having the axis order as defined by the source CRS. ST_Transfrom(st_point(1,2), 'EPSG:4326', 'EPSG:3857') the input is expected to be in lat/lon and the output will be in lon/lat. If you instead do ST_Transform(st_point(1,2), 'OGC:CRS84', 'EPSG:3857') the input is expected to be in lon/lat and the output will still be lon/lat.

In the case of GeoJSON inputs/data, the expected axis order is lon/lat. In practice, whether it is specified as EPSG:4326 or OGC:CRS84 makes no material difference, even if the GeoJSON file did not specify any CRS (for example: https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/geodata/files/polygons.geojson), most use cases just need the lon-lat values to be converted to X-Y values in some kind of projected CRS (e.g. EPSG:3857) for the purpose of plotting. For DuckDB to be truly "CRS-blind", I would prefer/expect ST_Transform to work like ST_Transform(St_Point(X,Y), 'source_projection_code', 'target_projection_code') -> ST_Point(X,Y), without caring if it is is 4326, 3857 or some user-specified CRS. So to me, always_xy is necessary.

hnagaty commented 1 week ago

@Maxxen Thanks for this explanation. It cleared the confusion that I had. May I propose that you add some of this explanation to this blog post (https://duckdb.org/2023/04/28/spatial.html) . It is one of the top ranked Google results when I started exploring DuckDB spatial.