duckdb / duckdb_spatial

MIT License
492 stars 41 forks source link

st_read_meta does not support the GDAL virtual filesystem interface! #269

Closed cboettig closed 9 months ago

cboettig commented 9 months ago

Note that st_read_meta just returns empty when we try and read over the virtual filesystem interface:

SELECT layers[1].geometry_fields[1].crs.auth_name as name,
       layers[1].geometry_fields[1].crs.auth_code as code 
       FROM st_read_meta('/vsicurl/https://github.com/duckdb/duckdb_spatial/raw/main/test/data/amsterdam_roads.fgb');

If we download the file first, it works fine:

SELECT layers[1].geometry_fields[1].crs.auth_name as name,
       layers[1].geometry_fields[1].crs.auth_code as code 
       FROM st_read_meta('amsterdam_roads.fgb');

image

Note that gdal CLI tools have no problem doing this:

gdalinfo -json /vsicurl/https://github.com/duckdb/duckdb_spatial/raw/main/test/data/amsterdam_roads.fgb

and notably, duckdb extension's st_read() is also quite happy to read over the vsi system.

Maxxen commented 9 months ago

I don't think st_read works with /vsicurl/ either anymore, Use DuckDB's HTTPFs instead when querying remote files.

Maxxen commented 9 months ago

Hmm, looks like it breaks due to DuckDB trying to glob the file path, we need to remove any vsi prefixes before calling into DuckDBs glob. That said, using httpfs prevents this issue anyway.

cboettig commented 9 months ago

indeed, I see!

Aside, but is there a reason for dropping the ability to use gdal's vsi? I'll try and go the httpfs route. I love duckdb's httpfs, but there's lots of nice stuff in the GDAL vsi that is highly optimized (caching, writing, etc).

cboettig commented 1 month ago

@Maxxen It looks like st_read() is quite happy with vsi support again, yay! This significantly speeds up many operations.
However, it seems not to be working with st_read_meta()?

Here's an example read with vsicurl. Works fine, and using vsicurl also means it leverages gdal cache, so if you run this a second time it is much faster. (which is not true without vsicurl)!

import duckdb
con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")
tblname = "test"

## path with VSI
path = "/vsicurl/https://data.source.coop/cboettig/us-boundaries/mappinginequality.json"

query = f'''
CREATE OR REPLACE VIEW {tblname} AS SELECT * FROM st_read('{path}');
'''

con.sql(query)
con.table("test")

However, st_read_meta comes back blank unless we remove vsicurl prefix:


# path = "/vsicurl/https://data.source.coop/cboettig/us-boundaries/mappinginequality.json"  # metadata empty
path = "https://data.source.coop/cboettig/us-boundaries/mappinginequality.json" # nonempty

query = f'''
CREATE OR REPLACE VIEW {tblname}_meta AS SELECT
    layers[1].feature_count as feature_count,
    layers[1].geometry_fields[1].name as geom_column_name,
    layers[1].geometry_fields[1].type as geom_type,
    layers[1].geometry_fields[1].crs.auth_name as name,
    layers[1].geometry_fields[1].crs.auth_code as code,
    layers[1].geometry_fields[1].crs.wkt as wkt,
    layers[1].geometry_fields[1].crs.proj4 as proj4
    FROM st_read_meta('{path}');
'''
con.sql(query)
con.table("test_meta")

It is obviously really really useful to be able to check the CRS of the data without downloading the whole data object. Of course we can just use gdal tools outside of duckdb, but when presenting this material in the classroom it feels really awkward to drop into a completely different tool for this after having just explained how much better duckdb is for what we are doing :smile: .