geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

Support for forcing driver in `read_dataframe` #310

Closed kylebarron closed 9 months ago

kylebarron commented 9 months ago

I have some CSV data that uses a file extension that is not .csv. It appears that the OGR CSV driver only auto-detects CSV data if the extension is .csv. Is there a way to force the use of a specific driver when not auto-detected? It appears there is not already. This would be akin to driver in fiona.open.

brendan-ward commented 9 months ago

It looks like Fiona uses the enabled_drivers to restrict the drivers that are checked by GDAL when opening a file for reading (not driver, which is used for writing). It is unclear from the example in the docstring if it actually allows overriding the driver used to open the file, or if it simply constraints the set of drivers checked - which would still depend on file extension.

In pyogrio we have it enable all drivers at startup; I'm not sure we want to selectively enable some of them.

The GDAL Python bindings let you first obtain a driver instance and then use that to open a path, so it may provide a better example to check.

theroggy commented 9 months ago

Have you tried this, from the GDAL CSV driver documentation?

For files structured as CSV, but not ending with .CSV extension, the 'CSV:' prefix can be added before the filename to force loading by the CSV driver.

kylebarron commented 9 months ago

For files structured as CSV, but not ending with .CSV extension, the 'CSV:' prefix can be added before the filename to force loading by the CSV driver.

Ooo... the driver documentation page is very long and I missed that. But yes that does work!

I'm happy to close this or keep it open in case forcing a driver is useful in other cases

brendan-ward commented 9 months ago

We can close until there is an urgent need to add that capability for other drivers.

theroggy commented 9 months ago

This works for many drivers already...

For non-file drivers the prefix to use is defined with the GDAL_DMD_CONNECTION_PREFIX metadata item, e.g.:

poDriver->SetMetadataItem(GDAL_DMD_CONNECTION_PREFIX, "MSSQL:");

But also e.g. the Geojson driver supports this according to the driver documentation.

And if you search in the gdal code base you'll see it is implemented for many drivers: link