geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
258 stars 22 forks source link

ENH?: make `layer=` mandatory if multiple layers in a file? #355

Closed theroggy closed 3 months ago

theroggy commented 5 months ago

At the moment, when a file being read contains > 1 layer and the layer parameter isn't specified, the first layer encountered in the file is being read.

This isn't really transparent behaviour, and I think it would be better if an error would be thrown when trying to read a multi-layer file without specifying layer. This isn't backwards compatible, but maybe now is the time still to change this?

An alternative that wouldn't break backwards compatibility would be to raise a warning.

Something to consider: what is a "layer": e.g. Geopackage files can also contain attribute tables which can be read by pyogrio. Are they to be considered as layers or can they be ignored if there is only one layer with geometries?

brendan-ward commented 5 months ago

True: it isn't always obvious to the user when there are additional layers.

I think we always read the first layer, even if it is tabular, though I haven't tripped over this with anything I've used so far.

I think a warning makes sense, possibly with a deprecation warning, if we want to make this required in a later version. Or we could just jump straight to raising an exception since we're still < 1.0. I'm not sure I have a strong preference.

In either case, we have to sort out the following situations:

Does anyone here have a strong preference for warning vs exception?

martinfleis commented 5 months ago

Or we could just jump straight to raising an exception since we're still < 1.0

Given pyogrio is going to be a default engine of geopandas, we should be really careful here. It is not a small change and ´geopandas.read_file` have this behaviour for a long time, way predating pyogrio. People just expect it to work.

I do agree that it is not ideal and I have been answering questions about that myself before so there is a space for improvement but I would definitely try to avoid raising.

I'd warn for multilayer GPKG or any other layered file that a layer is unspecified and that we read the first one + give a name of it.

Do we default to the one with geometries? Do we raise a warning or exception in this case?

What do we do now? I don't even know how would I create a file like that :D.

multiple geometry layers: raise a warning or exception

I'd warn.

theroggy commented 5 months ago

Or we could just jump straight to raising an exception since we're still < 1.0

Given pyogrio is going to be a default engine of geopandas, we should be really careful here. It is not a small change and ´geopandas.read_file` have this behaviour for a long time, way predating pyogrio. People just expect it to work.

I do agree that we need to be careful.

Possibly not really relevant, as the main thing that matters is to try avoiding working code to start breaking, nonetheless: I wonder if there are truly people that consciously expect that "the first table" is read if no layer is passed or is it just by accident that it works as they want. Or... are there just quite some bugs out there because people never noticed a wrong layer is being read...

Do we default to the one with geometries? Do we raise a warning or exception in this case?

What do we do now?

I had a look in GDAL, and the layers/tables are determined with a SQL statement that does a union all of all layers with geometry, then all attribute tables. So at the moment typically the first layer with geometry (if there is one) that was created in the GPKG will be returned. Note that there is no ORDER BY clause in the SQL query used, so the order of this result isn't guaranteed (on long term). Extract from the SQLite docs:

If there is no ORDER BY clause, then the order in which rows are extracted is undefined.
(In the current implementation, the queue becomes a FIFO if the ORDER BY clause is
omitted, but applications should not depend on that fact since it might change.)

I don't even know how would I create a file like that :D.


from pathlib import Path
import geopandas as gpd
import pandas as pd
import pyogrio
from shapely import Point

path = Path("test.gpkg") path.unlink(missing_ok=True) df = pd.DataFrame({"attr_table1_col": ["a1", "a2", "a3"]}) pyogrio.write_dataframe(df, path, layer="attr_table1") gdf = gpd.GeoDataFrame({"point1_col": ["p1", "p2", "p3"]}, geometry=[Point(1, 2)] 3) pyogrio.write_dataframe(gdf, path, layer="point1") gdf = gpd.GeoDataFrame({"point2_col": ["p1", "p2", "p3"]}, geometry=[Point(1, 2)] 3) pyogrio.write_dataframe(gdf, path, layer="point2") df = pd.DataFrame({"attr_table2_col": ["a1", "a2", "a3"]}) pyogrio.write_dataframe(df, path, layer="attr_table2") read_df = pyogrio.read_dataframe(path) print(read_df)

martinfleis commented 5 months ago

I wonder if there are truly people that consciously expect that "the first table" is read if no layer is passed or is it just by accident that it works as they want. Or... are there just quite some bugs out there because people never noticed a wrong layer is being read...

I do expect that the first layer is read in some cases. Since I am aware of this behaviour I tend to save those few characters specifying layer if I know which I saved first.

And yes, I've seen students running into bugs because they did not notice a wrong layer is read or had no clue the file has more layers.

So at the moment typically the first layer with geometry (if there is one) that was created in the GPKG will be returned.

Let's stick to that I'd say.