geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

Unable to set dataset creations exporting a data frame to GeoPackage #177

Closed felnne closed 1 year ago

felnne commented 1 year ago

Expansion of comment in https://github.com/geopandas/pyogrio/issues/71#issuecomment-1313891950.

I am trying to export a dataframe to a GeoPackage but with two dataset creation options set, specifically VERSION and ADD_GPKG_OGR_CONTENTS documented in https://gdal.org/drivers/vector/gpkg.html#dataset-creation-options.

The code I'm using:

from pathlib import Path

from pyogrio import read_dataframe, write_dataframe

geojson_path = Path('input.geojson')
output_path = Path('output.gpkg')

write_dataframe(read_dataframe(geojson_path), path=str(output_path), VERSION=1.3, ADD_GPKG_OGR_CONTENTS='NO')

Running this code gives the following:

Warning 6: dataset output.gpkg does not support layer creation option VERSION
Warning 6: dataset output.gpkg does not support layer creation option ADD_GPKG_OGR_CONTENTS

From this I assume all **kwargs are being set as layer creation options rather than dataset creation options but I'm unsure how to set these (without using another tool).

Python version: 3.9.1 pyogrio version: 0.4.2

$ gdalinfo --version                                                                                                                                                                          15:17:34
GDAL 3.5.2, released 2022/09/02

Happy to provide any other information and apologies if I've missed anything useful to debug this.

felnne commented 1 year ago

To sanity check the ADD_GPKG_OGR_CONTENTS option isn't being applied I checked whether the generated GPKG had a gpkg_ogr_contents table (I expected it not to):

$ sqlite3 output.gpkg
SQLite version 3.39.4 2022-09-29 15:55:41
Enter ".help" for usage hints.
sqlite> .tables  
gpkg_ogr_contents
gpkg_contents
gpkg_spatial_ref_sys
gpkg_extensions
gpkg_tile_matrix
gpkg_geometry_columns
gpkg_tile_matrix_set
input
jorisvandenbossche commented 1 year ago

Yes, so currently we pass options (**kwargs) only to the layer creation step (GDALDatasetCreateLayer):

https://github.com/geopandas/pyogrio/blob/bdd7bf4fd9fda544c8b781596eea84cee4248ddd/pyogrio/_io.pyx#L1275-L1287

And so passing dataset creation options is right now not possible (and I am not aware of any workaround ..).

But this is something we should certainly solve.

One option is to have an explicit dataset_options vs layer_options keywords that take a dict (as mentioned in https://github.com/geopandas/pyogrio/issues/71#issuecomment-1105487378), or at least have this for the dataset options, and leave generic kwargs for the layer creation options.

Another option could be to split the user-passed kwargs automatically into dataset and layer creation options, using the driver metadata (cfr https://github.com/geopandas/pyogrio/issues/103).

A third option to pass the kwargs to both dataset and layer creation doesn't seem desirable, since that causes warnings.

martinfleis commented 1 year ago

Another option could be to split the user-passed kwargs automatically into dataset and layer creation options

Are they always exclusive to dataset or layer?

jorisvandenbossche commented 1 year ago

Another option could be to split the user-passed kwargs automatically into dataset and layer creation options

Are they always exclusive to dataset or layer?

Using the code in https://github.com/geopandas/pyogrio/pull/189, checking this:

In [17]: for driver in pyogrio.list_drivers().keys():
    ...:     dataset_options = pyogrio.raw._parse_options_names(pyogrio.raw._get_driver_metadata_item(driver, "DMD_CREATIONOPTIONLIST"))
    ...:     layer_options = pyogrio.raw._parse_options_names(pyogrio.raw._get_driver_metadata_item(driver, "DS_LAYER_CREATIONOPTIONLIST"))
    ...:     common_options = set(dataset_options).intersection(set(layer_options))
    ...:     print(f"{driver}: {list(common_options) if common_options else '-'}")
    ...: 
    ...: 
ESRIC: -
FITS: -
PCIDSK: -
netCDF: -
PDS4: -
VICAR: -
JP2OpenJPEG: -
PDF: -
MBTiles: ['MINZOOM', 'DESCRIPTION', 'NAME', 'MAXZOOM']
BAG: -
EEDA: -
OGCAPI: -
ESRI Shapefile: -
MapInfo File: ['ENCODING']
UK .NTF: -
LVBAG: -
OGR_SDTS: -
S57: -
DGN: -
OGR_VRT: -
REC: -
Memory: -
CSV: ['GEOMETRY']
NAS: -
GML: -
GPX: -
LIBKML: ['LISTSTYLE_ICON_HREF', 'VISIBILITY', 'OPEN', 'SNIPPET', 'DESCRIPTION', 'LISTSTYLE_TYPE', 'NAME']
KML: -
GeoJSON: -
GeoJSONSeq: -
ESRIJSON: -
TopoJSON: -
Interlis 1: -
Interlis 2: -
OGR_GMT: -
GPKG: -
SQLite: -
OGR_DODS: -
WAsP: -
PostgreSQL: -
OpenFileGDB: -
DXF: -
CAD: -
FlatGeobuf: -
Geoconcept: -
GeoRSS: -
GPSTrackMaker: -
VFK: -
PGDUMP: -
OSM: -
GPSBabel: -
OGR_PDS: -
WFS: -
OAPIF: -
EDIGEO: -
SVG: -
CouchDB: -
Cloudant: -
Idrisi: -
ARCGEN: -
XLS: -
ODS: -
XLSX: -
Elasticsearch: -
Carto: -
AmigoCloud: -
SXF: -
Selafin: ['DATE']
JML: -
PLSCENES: -
CSW: -
VDV: -
GMLAS: -
MVT: ['MINZOOM', 'DESCRIPTION', 'NAME', 'MAXZOOM']
NGW: ['KEY', 'DESCRIPTION']
MapML: -
TIGER: -
AVCBin: -
AVCE00: -
HTTP: -

So there are some drivers where an option name can be passed both to dataset creation and layer creation .. (although for the most common ones it is not the case, so automatically splitting might still be convenient for those). (I didn't check the specific cases, for some it might also give no difference in practice if it is passed as dataset or layer option)

We could also have both: explicit dataset_options and layer_options keyword where you need to pass a dict, but still allow **kwargs as well that is passed automatically to either of those. Then we give the convenience for most cases, but give the explicit option when needed.

martinfleis commented 1 year ago

We could also have both: explicit dataset_options and layer_options keyword where you need to pass a dict, but still allow **kwargs as well that is passed automatically to either of those. Then we give the convenience for most cases, but give the explicit option when needed.

That is not a bad solution. +1

jorisvandenbossche commented 1 year ago

OK, I updated https://github.com/geopandas/pyogrio/pull/189 to take that route