duckdb / duckdb_spatial

MIT License
492 stars 41 forks source link

Is it possible to read more than one file, using some pattern? #191

Open aborruso opened 1 year ago

aborruso commented 1 year ago

Hi, I have a lot of file pbf in a folder. I am able to read a single file using SELECT * FROM ST_READ('11_1076_766.pbf');.

It seems not possibile to run something like SELECT * FROM ST_READ('/folder/*.pbf');, I have this error:

Error: IO Error: GDAL Error (4): /folder/*.pbf: No such file or directory

Is there another way to do it? Is there another way to merge all?

Thank you

Maxxen commented 1 year ago

I don't think this is possible right now, but it's actually a pretty intriguing idea. DuckDB has some reusable code used by the json/csv/parquet reader functions to handle multiple file inputs and maybe we would make it work for ST_Read too with some work.

However you can get the available files matching a pattern from within duckdb using SELECT * FROM glob('...') function.

aborruso commented 1 year ago

I don't think this is possible right now, but it's actually a pretty intriguing idea. DuckDB has some reusable code used by the json/csv/parquet reader functions to handle multiple file inputs and maybe we would make it work for ST_Read too with some work.

Do I turn this into a feature request?

Thank you

marklit commented 9 months ago

It would be great to see glob support added to ST_READ. At the moment, I need a SQL statement to collect the bathymetry from each depth level from Natural Earth. It would be nice to wrap that up into a single SQL statement. It would also bring ST_READ's behaviour inline with the CSV and parquet reading functions in DuckDB as well.

$ ls -lht ne_10m_bathymetry_*.shx
-rwxrwxrwx  1 mark  staff   124B Aug 13  2017 ne_10m_bathymetry_A_10000.shx
-rwxrwxrwx  1 mark  staff   276B Aug 13  2017 ne_10m_bathymetry_B_9000.shx
-rwxrwxrwx  1 mark  staff   220B Aug 13  2017 ne_10m_bathymetry_C_8000.shx
-rwxrwxrwx  1 mark  staff   572B Aug 13  2017 ne_10m_bathymetry_D_7000.shx
-rwxrwxrwx  1 mark  staff   4.2K Aug 13  2017 ne_10m_bathymetry_E_6000.shx
-rwxrwxrwx  1 mark  staff    15K Aug 13  2017 ne_10m_bathymetry_F_5000.shx
-rwxrwxrwx  1 mark  staff    14K Aug 13  2017 ne_10m_bathymetry_G_4000.shx
-rwxrwxrwx  1 mark  staff   5.3K Aug 13  2017 ne_10m_bathymetry_H_3000.shx
-rwxrwxrwx  1 mark  staff   1.7K Aug 13  2017 ne_10m_bathymetry_I_2000.shx
-rwxrwxrwx  1 mark  staff   1.3K Aug 13  2017 ne_10m_bathymetry_J_1000.shx
-rwxrwxrwx  1 mark  staff   4.2K Aug 13  2017 ne_10m_bathymetry_K_200.shx
-rwxrwxrwx  1 mark  staff   276B Aug 13  2017 ne_10m_bathymetry_L_0.shx
$ duckdb
SELECT COUNT(*) FROM ST_READ('ne_10m_bathymetry_*.shx');
Error: IO Error: GDAL Error (4): ne_10m_bathymetry_*.shx: No such file or directory
jaanli commented 8 months ago

+1, need this as well -- trying to combine shapefiles in this directory: https://www2.census.gov/geo/tiger/TIGER2020/PUMA20/ (to make this plot spatial: https://jaanli.github.io/american-community-survey/income)

SQL that does not work with the spatial extension's lack of support for glob syntax:

COPY(
SELECT * FROM '~/data/american_community_survey/TIGER2020/PUMA20/**/*.shp'
) TO '~/data/american_community_survey/shapefiles/TIGER2020/PUMA20/combined/TIGER2020_PUMA20.shp' WITH (FORMAT GDAL, DRIVER 'ESRI Shapefile');

Otherwise, this extension is amazing and lets me do things I could not before -- thank you!!

Here's similar code with Geopandas:

from pathlib import Path 
dir = Path('/Users/me/data/american_community_survey/shapefiles/TIGER2020/PUMA20')

gdf_list = []
for file in dir.glob('**/*.shp'):
    print(file)
    gdf = gpd.read_file(file)
    gdf_list.append(gdf)
gdf = pd.concat(gdf_list)
gdf.to_file(dir / 'combined' / 'TIGER2020_PUMA20.shp')