duckdb / duckdb_spatial

MIT License
426 stars 32 forks source link

Detecting a broader amount of invalid geometry #73

Open marklit opened 1 year ago

marklit commented 1 year ago

I believe you're using GEOS' invalid geometry detector. I've seen a large number of geometry that will pass through its filters without issue but will be rejected by BigQuery. Is there any chance more validation libraries could be added that detect a larger number of issues?

Below is an example of WKB string that is rejected by BigQuery but DuckDB/GEOS reports as being valid.

$ echo "01060000000100000001030000000100000011000000AF7C96E7C1995EC0BDAC8905BEDC4240849B8C2AC3995EC0A5846055BDDC424019E59997C3995EC0F321A81ABDDC4240245F09A4C4995EC01DE90C8CBCDC4240ACC612D6C6995EC0D07F0F5EBBDC424029CE5147C7995EC0876BB587BDDC4240FA7DFFE6C5995EC03F8BA548BEDC424071C806D2C5995EC04B395FECBDDC4240A180ED60C4995EC004594FADBEDC42402A36E675C4995EC0F8AA9509BFDC4240962023A0C2995EC0622D3E05C0DC4240C6A4BF97C2995EC0213EB0E3BFDC4240AE62F19BC2995EC0C23577F4BFDC4240FCFF3861C2995EC003250516C0DC4240F06B2409C2995EC051F69672BEDC424020F0C000C2995EC051F69672BEDC4240AF7C96E7C1995EC0BDAC8905BEDC4240" > bad_poly.csv

$ bq load --source_format=CSV \
    --quiet \
    geo.wkb_test \
    ./bad_poly.csv \
    geom:GEOGRAPHY
BigQuery error in load operation: Error processing job 'geo-######:bqjob_r#######': Error
while reading data, error message: Could not parse
'01060000000100000001030000000100000011000000AF7C96E7C1995EC0BDAC8905BEDC42408...' as GEOGRAPHY for field geom (position 0)
starting at location 0  with message 'Invalid polygon loop: Edge 10 crosses edge 12; in WKB geography'
Failure details:
- Error while reading data, error message: CSV processing encountered
too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error
percent: 0

I converted the WKB above into WKT for the example below.

$ ~/duckdb_spatial/build/release/duckdb -unsigned
select ST_IsValid(ST_GEOMFROMTEXT('MULTIPOLYGON(((-122.40246 37.724549,-122.402537 37.724528,-122.402563 37.724521,-122.402627 37.724504,-122.402761 37.724468,-122.402788 37.724534,-122.402704 37.724557,-122.402699 37.724546,-122.402611 37.724569,-122.402616 37.72458,-122.402504 37.72461,-122.402502 37.724606,-122.402503 37.724608,-122.402489 37.724612,-122.402468 37.724562,-122.402466 37.724562,-122.40246 37.724549)))')) AS geom_valid;
┌────────────┐
│ geom_valid │
│  boolean   │
├────────────┤
│ true       │
└────────────┘

These are a collection of BigQuery geometry rejection error messages I've seen. I removed the numbers from their messages so they're easier to group together.

"Invalid polygon loop: Edge  has duplicate vertex with edge ; in WKB geography"
"Invalid polygon loop: Edge  has duplicate vertex with edge ; in loop ; in WKB geography"
"Invalid polygon loop: Edge  has duplicate vertex with edge ; in polygon ; in WKB geography"
"Invalid polygon loop: Edge  crosses edge ; in WKB geography"
"Invalid polygon loop: Edge  has duplicate vertex with edge ; in loop ; in polygon ; in WKB geography"
"Invalid polygon loop: Edge  crosses edge ; in polygon ; in WKB geography"
"Polygon loop should have at least  unique vertices, but only had ; in polygon ; in WKB geography"
"Invalid polygon loop: Edge  crosses edge ; in loop ; in WKB geography"
"Invalid polygon loop: Edge  crosses edge ; in loop ; in polygon ; in WKB geography"
"Loop  edge  crosses loop  edge ; in WKB geography"
"Polygon loop should have at least  unique vertices, but only had ; in loop ; in WKB geography"
"Invalid nesting: loop  should not contain loop ; in WKB geography"
"Loop  edge  crosses loop  edge ; in polygon ; in WKB geography"
"Invalid nesting: loop  should not contain loop ; in polygon ; in WKB geography"
Maxxen commented 1 year ago

I think the problem here is that BigQuery uses GEOGRAPHY, which is based on a spherical geometry model. It is perfectly possible to create geometries that do not self-intersect in the cartesian plane but do when projected on a sphere. In essence, this is issue is an extension of #16.

pramsey commented 8 months ago

For small enough geometries running MakeValid in gnomonic projection will create valid geographies, but this will not fix world-spanning issues.