Open Maxxen opened 1 year ago
I don't if it is help, but actually I used the rust port of https://geographiclib.sourceforge.io/C++/doc/index.html for geodesic calculation.
I'm currently trying to replicate the following
ST_DistanceSphere(
ST_MakePoint(lon, lat),
ST_MakePoint(last_lon, last_lat)
) AS distance_m,
with
ST_Distance(
ST_Point(lon, lat),
ST_Point(last_lon, last_lat)
) AS distance_m,
but not being a spatial guy, I'm unsure of the implications of this. From what I've discerned I would require these Spherical calculations to provide the 6370986 meter radius effect on the measurements between the two points.
this sounds like a pretty fundamental requirement of geo data so I'm adding this here in hope that it isn't required, but perhaps does require spherical calculations.
To clarify, all the algorithms implemented as of now assume a planar coordinate system. This makes a lot of algorithms much easier and faster to implement and works well when you are processing geospatial data that lies within a small-ish region, say e.g. on the scale of a country or two. Once you reach a scale where the curvature of the earth becomes important, then yeah, measurements like distances and areas become extremely inexact if implemented with straight forward cartesian geometry . The PostGIS documentation has a great example of this.
Note that in general we don't really know the exact shape of the earth, so there is really three different ways to approximate geospatial calculations with progressively increasing difficulty, performance overhead and precision.
Like PostGIS it would be cool if we could support all three to some degree, but a lot of the more advanced algorithms (e.g. polygon intersection) are already extremely complicated in the planar case, so I'd like to take this in small steps.
That said, I have something in the works using Geographiclib for the elementary algorithms (distance, length, area). But I think we should implement and focus on 2. in the future for a potential GEOGRAPHY
type.
Partially implemented in #74
Guys guys... #74 was merged in May 2023 but the documentation was never updated so I just discovered that they are in the current extension!!!
D select ST_Distance_spheroid(ST_Point(49.60, 6.12), ST_Point(49.60, 6.14));
┌────────────────────────────────────────────────────────────────────┐
│ st_distance_spheroid(st_point(49.60, 6.12), st_point(49.60, 6.14)) │
│ double │
├────────────────────────────────────────────────────────────────────┤
│ 1445.776772642289 │
└────────────────────────────────────────────────────────────────────┘
Hi – it's exciting to see that this extension is still being actively developed! Almost a year on since this roadmap item was opened, I was wondering if the calculation of geographic/spheroid distances between polygons and points was on the horizon? Otherwise it looks like falling back to PostGIS is my best bet (but open to other suggestions).
Line segment <-> point and the additional combinations that follow require an iterative algorithm that I've left out for now.
I would also be curious to know what this algorithm involves. Thanks!
Time flies! Ive recently completed a pretty major overhaul of the internal geometry representation used during execution so im planning on going back and adding the remaining missing measurement functions, including geodetic/haversine based ops, in the near future.
Im currently working on refactoring the docs and will try to shape up and share more about the roadmap for my plans for 2024 as well.
DuckDB version == v1.0.0
Thanks for your contributions on this! I think I've identified a possible issue where the x and y coordinates are getting transposed for each other in one of the checks. If you go past -90 (min latitude but not min longitude) on the x coordinate (longitudinal) plane, the calculations seem to stop working
SELECT ST_AREA_SPHEROID(ST_ENVELOPE(ST_BUFFER(ST_POINT(-89.999, 30), .0001)));
st_area_spheroid(st_envelope(st_buffer(st_point(-89.999, 30), .0001))) │
│ double │
├────────────────────────────────────────────────────────────────────────┤
│ 0.008709549903869629
^ this works as expected, while incrementing up to from -89.999 to -90 for the longitudinal coordinate returns a nan
with everything else the same
SELECT ST_AREA_SPHEROID(ST_ENVELOPE(ST_BUFFER(ST_POINT(-90, 30), .0001)));
st_area_spheroid(st_envelope(st_buffer(st_point(-90, 30), .0001))) │
│ double │
├────────────────────────────────────────────────────────────────────┤
│ nan
@Maxxen I'm also seeing the error above.
overturemaps download -f geoparquet --bbox -71.063432,42.301047,-71.009501,42.382221 -o boston.parquet -t building
D SELECT
id,
ST_AREA(ST_GeomFromWKB(geometry)) as area,
ST_Area_Spheroid(ST_geomFromWKB(geometry)) AS cartesian_area
FROM 'boston.parquet' limit 5;
┌──────────────────────────────────┬────────────────────────┬───────────────────┐
│ id │ area │ cartesian_area │
│ varchar │ double │ double │
├──────────────────────────────────┼────────────────────────┼───────────────────┤
│ 08b2a3066bd41fff0200fc6ab276c90d │ 1.285837500012519e-08 │ 51.9845110103488 │
│ 08b2a3066bd4cfff0200bc057338d4ed │ 1.2229789998652358e-08 │ 49.44350370578468 │
│ 08b2a3066bd41fff020028311e9c371c │ 1.5778415000042524e-08 │ 63.7901326790452 │
│ 08b2a3066bd4cfff02008dcf88943bb9 │ 1.1989005000108085e-08 │ 48.47045111026091 │
│ 08b2a3066bd4cfff020002be80094552 │ 2.0510755000441452e-08 │ 82.92415748536587 │
overturemaps download -f geoparquet --bbox -122.365604,47.573155,-122.307926,47.623162 -o seattle.parquet -t building
D SELECT
id,
ST_AREA(ST_GeomFromWKB(geometry)) as area,
ST_Area_Spheroid(ST_geomFromWKB(geometry)) AS cartesian_area
FROM 'seattle.parquet' limit 5;
┌──────────────────────────────────┬────────────────────────┬────────────────┐
│ id │ area │ cartesian_area │
│ varchar │ double │ double │
├──────────────────────────────────┼────────────────────────┼────────────────┤
│ 08b28d555096bfff0200ba6e244e17a4 │ 8.85314630001753e-07 │ nan │
│ 08b28d5550948fff0200b8ce7ac64e3b │ 1.0191719999341135e-08 │ nan │
│ 08b28d555095efff02004bd14aeeb6be │ 9.111243699996493e-07 │ nan │
│ 08b28d5550958fff0200eb9c47a75fc9 │ 2.090709999588343e-09 │ nan │
│ 08b28d5550855fff02007c80ff8ad9f0 │ 2.003438000002535e-08 │ nan │
└──────────────────────────────────┴────────────────────────┴────────────────┘
I just want to weigh in here that I believe this is the root of an issue I am having with Overture Maps, specifically with:
ST_Transform(ST_GeomFromWKB(geometry), 'EPSG:4326', 'EPSG:3857') as reprojected_geometry
I then calculate the ST_Area(reprojected_geometry)
When I run that on the whole world I get NULL values for a good chunk of it:
I am a fairly novice, self taught developer - I don't understand the specific details of this kind of thing, but it's my understanding that ST_Transform
uses spherical calculations.
Hi! There's a couple of things at play here.
ST_Transform
defaults to respecting the axis order defined by the CRS authority. EPSG:4326
is defined as lat/lon, while 'EPSG:3857'
is defined as lon/lat. Meaning that when transforming 'EPSG:4326
to EPSG:3857
it is expected that the input is already in lat/lon axis order.
You can pass a boolean as an optional third argument to force st_transform
to always expect [lon/lat] axis order regardless of what projection you use. You can also use st_flipcoordinates
to adjust the axis order manually. The new docs are still a bit rough around the edges but I've attempted to document this better here: . https://github.com/duckdb/duckdb_spatial/blob/main/docs/functions.md#st_transform
Additionally, while most functions in the spatial extension is ambivalent to the axis order or projection (as they almost all work in planar space) the _spheroid
functions expect their input to be in lat/lon EPSG:4326 (WGS84).
This is a common cause of confusion, see e.g: https://github.com/duckdb/duckdb_spatial/issues/222, https://github.com/duckdb/duckdb_spatial/issues/219 https://github.com/duckdb/duckdb_spatial/issues/215 https://github.com/duckdb/duckdb_spatial/issues/179 https://github.com/duckdb/duckdb_spatial/issues/63
I've elaborated more on this both on discord and in the issues reported above. But in short:
Most GIS tools tend to always work with a fixed axis order. Most FOSS GIS tools prefer lon/lat (x/y) while others (especially in the GPS/radio-space) prefer lat/lon. Around the time DuckDB spatial was created it seemed like to me that the big libraries in the open source GIS world such as GDAL and PROJ had recently made an attempt to actually start respecting the CRS authority instead of always expecting lon/lat. See e.g. GDAL axis order post 3.0 and PROJ's FAQ. Therefore the approach taken in DuckDB spatial was to simply stay ambivalent to the axis order used. As we don't store SRID's in each geometry anyway it made sense to me that the user would ultimately be responsible for knowing what coordinate system and axis order they worked within.
Unfortunately it seems like the idea of respecting the axis order defined by the CRS authority never really caught on in practice. Modern specs such as GeoJSON uses geographic coordinates but mandates lon/lat anyway, and even brand new specs such as GeoParquet require lon/lat always.
Judging by the number of issues opened by users confused by this, I am definitely ready to change this behavior so that we also normalize all coordinates into easting/northing (e.g. lon/lat) when ingested by duckdb internally. However, this will require a bit of work as we then would need some way to track the projection information for each geometry column whenever we import/export data from duckdb spatial so that we can flip the coordinates back from/to what the CRS mandates when data enters or leaves spatial.
I have some ideas on how to do this but it is going to be part of a larger overhaul I've got planned for how we handle projection information in spatial, so I don't have a timeline for when this behavior will change.
ST_Transform defaults to respecting the axis order defined by the CRS authority. EPSG:4326 is defined as lat/lon, while 'EPSG:3857' is defined as lon/lat. Meaning that when transforming 'EPSG:4326 to EPSG:3857 it is expected that the input is already in lat/lon axis order.
To further complicate matters (and mentioned further down the post), it is also "expected" that GeoJSON input sources will be in both EPSG:4326 and X-Y (or Lon-Lat) order, and current behavior and expectation in duckdb means always_xy
must always be used for 4326 -> 3857 conversion of GeoJSON data when imported into DuckDB.
As we don't store SRID's in each geometry anyway it made sense to me that the user would ultimately be responsible for knowing what coordinate system and axis order they worked within. However, this will require a bit of work as we then would need some way to track the projection information for each geometry column whenever we import/export data from duckdb spatial so that we can flip the coordinates back from/to what the CRS mandates when data enters or leaves spatial.
As a user, I think it is okay to not track the projection/SRID, provided that DuckDB does not change the axis order without explicit user intent/awareness. By that I mean, the output of any function should follow the axis order of the input (unless the user chooses otherwise).
means always_xy must always be used for 4326 -> 3857 conversion of GeoJSON data when imported into DuckDB.
Not necessarily, you could transform OGC:CRS84
-> EPSG:3857
instead in which case you won't need it, as OGC:CRS84 is the same as WGS84 but with a lon/lat axis order.
provided that DuckDB does not change the axis order without explicit user intent/awareness. By that I mean, the output of any function should follow the axis order of the input (unless the user chooses otherwise).
Thats how it currently works. The only function that can change the axis order is ST_Transform
(and I guess st_flipcoordinates
), and it does so by trusting that the user supplies (and expects) geometry in the axis order defined by the given source and target CRS.
Not necessarily, you could transform
OGC:CRS84
->EPSG:3857
instead in which case you won't need it, as OGC:CRS84 is the same as WGS84 but with a lon/lat axis order.
Refer to the screenshot: https://github.com/duckdb/duckdb_spatial/issues/211#issuecomment-2316275955. St_Xmax ends up picking up the latitude values (which is problematic for me, although I probably made a wrong flip in between to avoid the infinity values).
The only function that can change the axis order is ST_Transform (and I guess st_flipcoordinates), and it does so by trusting that the user supplies (and expects) geometry in the axis order defined by the given source and target CRS.
I guess this is the part where there is a conflict in expectations - I don't expect DuckDB to be having any expectations regarding CRS, when DuckDB is not keeping track of CRS (or "CRS-blind"). Of course, this is not a statement on which of these expectations are more justified, just a sharing of user sentiment.
Thanks for the explanation @Maxxen!
Additionally, while most functions in the spatial extension is ambivalent to the axis order or projection (as they almost all work in planar space) the _spheroid functions expect their input to be in lat/lon EPSG:4326 (WGS84).
This is the biggest point of confusion for me. If the EPSG:4326
spec mandates long/lat, why would the spheroid logic require flipping the coordinates to be the opposite of what they are expected to be in? Feels like a very small minority of users, if any would ever get the usage of this right.
If I'm following the PostGIS convention, which I remember reading the spatial extension is modeled after, if I just wanted to inline the selection of a point via ST_POINT
, the only way to get the "correct" result for a query of the area of a buffer of it would be with an invocation of ST_FLIPCOORDINATES
which feels very unergonomic.
SELECT
ST_AREA_SPHEROID(
ST_BUFFER(
ST_FLIPCOORDINATES(ST_POINT(-90,30)),
.00001
)
)
I don't expect DuckDB to be having any expectations regarding CRS, when DuckDB is not keeping track of CRS (or "CRS-blind").
And it generally doesn't. You have to supply both a "source" and a "target" SRID when transforming a geometry, and DuckDB will treat the geometry as having the axis order as defined by the source CRS. That is, if you transform ST_Transfrom(st_point(1,2), 'EPSG:4326', 'EPSG:3857')
the input is expected to be in lat/lon and the output will be in lon/lat. If you instead do ST_Transform(st_point(1,2), 'OGC:CRS84', 'EPSG:3857')
the input is expected to be in lon/lat and the output will still be lon/lat.
The only exception to this rule is the _spheroid
functions, as the axis order is actually significant for the result and the algorithm is hardcoded for the WGS84 datum. Here we chose again to go with respecting the axis order as defined by the authority, that is __lat/lon__
. It would be even more inconsistent if we went with the crs-mandated axis order when projecting, but not in the _spheroid
functions, or vice versa.
This is the biggest point of confusion for me. If the EPSG:4326 spec mandates long/lat, why would the spheroid logic require flipping the coordinates to be the opposite of what they are expected to be in?
@jtbaker you still got it the wrong way around, EPSG:4326
mandates lat/lon. PostGIS convention is to not respect the CRS authority and always assumes all geometries to be in lon/lat. But they also "track" the CRS of each geometry value so that you don't have to supply a source SRID when transforming.
Just to reiterate, Im fully on board with changing the current behavior because, for better or worse, a lot of users are obviously not used to or do not expect to consider that different coordinate systems are standardized to have different axis-orders. The convention for GIS tools to normalize geometries so that x
corresponds to the "easting" and y
"northing" is very strong - and that's not necessarily a bad thing. We are not completely alone in this though, pyproj also has a always_xy
parameter.
And chances are that more systems will default to respecting the axis order defined by the spec in the future, that is, if PROJ manages to "[...] spearhead this effort, hopefully setting a good example for the rest of the geospatial industry."
Im just trying to contextualize why the current behavior is what it is, why it is not a bug, and why I think it makes the most sense considering our approach of being "blind" to the CRS in general (i.e. not implicit tracking of geometry columns, no normalization/coordinate flipping on import/export).
And it generally doesn't. You have to supply both a "source" and a "target" SRID when transforming a geometry, and DuckDB will treat the geometry as having the axis order as defined by the source CRS. ST_Transfrom(st_point(1,2), 'EPSG:4326', 'EPSG:3857') the input is expected to be in lat/lon and the output will be in lon/lat. If you instead do ST_Transform(st_point(1,2), 'OGC:CRS84', 'EPSG:3857') the input is expected to be in lon/lat and the output will still be lon/lat.
In the case of GeoJSON inputs/data, the expected axis order is lon/lat. In practice, whether it is specified as EPSG:4326 or OGC:CRS84 makes no material difference, even if the GeoJSON file did not specify any CRS (for example: https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/geodata/files/polygons.geojson), most use cases just need the lon-lat values to be converted to X-Y values in some kind of projected CRS (e.g. EPSG:3857) for the purpose of plotting. For DuckDB to be truly "CRS-blind", I would prefer/expect ST_Transform to work like ST_Transform(St_Point(X,Y), 'source_projection_code', 'target_projection_code') -> ST_Point(X,Y)
, without caring if it is is 4326, 3857 or some user-specified CRS. So to me, always_xy
is necessary.
@Maxxen Thanks for this explanation. It cleared the confusion that I had. May I propose that you add some of this explanation to this blog post (https://duckdb.org/2023/04/28/spatial.html) . It is one of the top ranked Google results when I started exploring DuckDB spatial.
While I'm hesitant to implement a full blown
GEOGRAPHY
type, I do think that having a couple of spatial functions using geodetic algorithms operating on a sphere would be a good complexity/usefulness compromise.