duckdb / duckdb_spatial

MIT License
492 stars 41 forks source link

`ST_Distance_Sphere` Function Fails in Specific Longitude Ranges #450

Closed tshelianthus closed 3 weeks ago

tshelianthus commented 3 weeks ago

Before Submitting


What Happens?

When using the ST_Distance_Sphere function to calculate geographic distances, it was observed that the function fails to return expected values in certain specific longitude ranges and instead returns null values. This issue prevents accurate calculation of geographic distances.

Query:

INSTALL spatial;
LOAD spatial;

SELECT
    '0' AS id,
    COALESCE(NULLIF(ST_Distance_Spheroid(st_point(40.6446, 73.7797), st_point(52.3130, 4.7725)), 'NaN'), -1) AS dist
UNION ALL
SELECT
    '1' AS id, 
    COALESCE(NULLIF(ST_Distance_Spheroid(st_point(121.489182, 31.11484), st_point(121.483771, 31.11866)), 'NaN'), -1) AS dist;
Result: id dist
0 5243187.666873225
1 -1

To Reproduce

Upon further verification, I observed that this issue also occurs within certain specific longitude ranges, raising concerns about the reliability of ST_Distance_Spheroid.

  1. Create points from longitude -180 to 180 with a 3-degree interval, and generate destination points 3 degrees away for each original point.
  2. Perform distance measurements using ST_Transform, ST_Distance, and ST_Distance_Sphere functions. Check for any undefined or incorrect values returned by ST_Distance_Spheroid.
  3. Execute the query and observe the results to verify if ST_Distance_Spheroid produces undefined or unreliable distance values within specific longitude ranges.

Sample Code:

import duckdb
import pandas as pd
from shapely.geometry import Point

# Connect to DuckDB database (using an in-memory database)
con = duckdb.connect(database=':memory:', read_only=False)

# Install and load the spatial extension
con.execute("INSTALL spatial;")
con.execute("LOAD spatial;")

# Create the example table with geometry columns
con.execute("""
DROP TABLE IF EXISTS example;
CREATE TABLE example (
    id INTEGER PRIMARY KEY,
    ori_geom GEOMETRY,
    des_geom GEOMETRY
)
""")

# Generate sample data points for every 3-degree longitude increment
data = [
    (lon, Point(lon, 0), Point(lon + 3, 0))  # Create original and destination points
    for lon in range(-180, 180, 3)
]

# Convert the data into a Pandas DataFrame
df = pd.DataFrame(data, columns=['id', 'ori_geom', 'des_geom'])

# Insert data from the DataFrame into the DuckDB table
con.execute("INSERT INTO example SELECT * FROM df")

# Query to calculate various distances and handle undefined values
res_query = """
SELECT 
    id, 
    ST_AsText(ori_geom) AS ori_geom_wkt, 
    ST_AsText(des_geom) AS des_geom_wkt,

    -- Distance in degrees between the two points
    ST_Distance(ori_geom, des_geom) AS dist_degree,  

    -- Transformed distance in Web Mercator projection
    ST_Distance(
        ST_Transform(ori_geom, 'EPSG:4326', 'EPSG:3857', always_xy := TRUE), 
        ST_Transform(des_geom, 'EPSG:4326', 'EPSG:3857', always_xy := TRUE)
    ) AS dist_transform,

    -- Spherical distance between the two points
    ST_Distance_Sphere(ori_geom, des_geom) AS dist_sphere,  

    -- Spheroidal distance, handling NaN values
    COALESCE(NULLIF(ST_Distance_Spheroid(ori_geom, des_geom), 'NaN'), -1) AS dist_spheroid,

    -- Boolean result indicating if the spheroidal distance is valid
    CASE   
        WHEN COALESCE(NULLIF(ST_Distance_Spheroid(ori_geom, des_geom), 'NaN'), -1) = -1 THEN FALSE
        ELSE TRUE
    END AS dist_spheroid_res
FROM example
"""

# Execute the query and fetch the results into a DataFrame
result = con.execute(res_query).fetchdf()
print(result)

Based on the test results, I created a diagram illustrating the corresponding longitude ranges to identify the "working" and "non-working" longitude bands for reference.

ST_Distance_Sphere


Environment


Maxxen commented 3 weeks ago

Hello!

ST_Distance_Spheroid is defined to only returns correct results if the input geometries are in WGS84 (EPSG:4326) latitude/longitude axis order. It seems like from your example that rhe input is neither.

tshelianthus commented 3 weeks ago

Hello!

ST_Distance_Spheroid is defined to only returns correct results if the input geometries are in WGS84 (EPSG:4326) latitude/longitude axis order. It seems like from your example that rhe input is neither.

Thank you for reply!

When I revisited the documentation example, I indeed noticed that the annotation specifies that the coordinates are in WGS84 and follow the [latitude, longitude] axis order.

But there are still questions that need to be discussed.

Check Function Docs

Example from ST_Distance_Sphere:

-- Note: the coordinates are in WGS84 and [latitude, longitude] axis order
-- Whats the distance between New York and Amsterdam (JFK and AMS airport)?
SELECT st_distance_spheroid(
st_point(40.6446, 73.7797),
st_point(52.3130, 4.7725)
);
----
5243187.666873225
-- Roughly 5243km!

Check Point Usual Usage

However, I observed that in the st_distance_spheroid function, the st_point function takes parameters in the order (latitude, longitude), which deviates from the usual conventions.

For example:

  1. PostGIS uses the (longitude, latitude) order in ST_Point, defined as:

    ST_Point(float x, float y)

    For geodetic coordinates, x represents longitude, and y represents latitude.

    SELECT ST_Point(-71.104, 42.315);  -- x is longitude, y is latitude
  2. WKT (Well-Known Text) also follows the (longitude, latitude) order.

    POINT(x y)  -- x is longitude, y is latitude

Verify ST_Point Geometry in DuckDB

To verify if ST_Point in DuckDB generates geometry objects correctly, I used the ST_X and ST_Y functions.

Query:

SELECT 
    st_x(st_point(40.6446, 73.7797)) AS lon, 
    st_y(st_point(40.6446, 73.7797)) AS lat, 
    st_astext(st_point(40.6446, 73.7797)) AS geom_wkt;
Result: lon lat geom_wkt
40.6446 73.7797 POINT (40.6446 73.7797)

The results show that ST_Point in DuckDB generates geometry objects in the order (longitude, latitude), which does align with the usual conventions.

As a result, the validity of distance calculations using st_distance_spheroid remains questionable.

Maxxen commented 3 weeks ago

DuckDB geometries, and ST_Point doesn't care about the axis order, as far as we know its just "X" and "Y", if that means "lat/lon" or "lon/lat" is completely dependent on the CRS you are working in.

There are only a few exceptions where this matters, and that is the _Sphere/_Spheroid functions which do require x to be lat, and y to be lon, as well as be within the valid ranges defined by the WGS84 datum. This is why you have to pass always_xy := true when transforming to EPSG:3857 without it ST_Transform would expect the input to be in lat/lon (as again, that is what EPSG:4326 defines).

The reason we use lat/lon convention here is because that is the axis order that WGS84/EPSG:4326 is standardized to have. Despite this other GIS tools (like PostGIS) have made the choice always treat x=lon, y=lat regardless of the CRS definition. Because DuckDB doesn't actually store the CRS/SRID in the geometry type, It made a lot of sense for us to try to avoid making a choice in this matter at all and simply leave it up to the user to know what projection they are working in.

See my post in this thread for more info.

tshelianthus commented 3 weeks ago

Thank you for the detailed explanation!

I now understand the reasoning behind how DuckDB handles geometries and the dependency on axis order according to the coordinate reference system.

I also noticed from the roadmap that support for spherical geometry will be discontinued. And I found that when using projected coordinates, the st_distance result is equivalent to st_distance_spheroid. I will use this approach as an alternative for distance calculations.

ST_Distance(
    ST_Transform(ori_geom, 'EPSG:4326', 'EPSG:3857', always_xy := TRUE), 
    ST_Transform(des_geom, 'EPSG:4326', 'EPSG:3857', always_xy := TRUE)
)