geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

use_arrow=True vs False: different empty dataframe result when no columns fetched #263

Closed theroggy closed 10 months ago

theroggy commented 11 months ago

When reading no columns from a file, this results in an empty dataframe with no columns using read_dataframe, you get a different result depending on use_arrow:

Not sure what is the expected behaviour. For what it is worth, fiona seems to do the same as use_arrow=False.

Script to reproduce:

import pyogrio

url = "https://github.com/theroggy/pysnippets/raw/main/pysnippets/pyogrio/polygon-parcel_31370.zip"
for use_arrow in [True, False]:
    gdf = pyogrio.read_dataframe(url, use_arrow=use_arrow, columns=[], read_geometry=False)
    print(f"\nuse_arrow={use_arrow}, len(gdf): {len(gdf)}, gdf:\n{gdf}")

relevant output:

use_arrow=True, len(gdf): 46, gdf:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]

use_arrow=False, len(gdf): 0, gdf:
Empty DataFrame
Columns: []
Index: []
brendan-ward commented 11 months ago

I wonder if we should instead raise an exception if read_geometry=False, columns=[], fid_as_index=False? Because in that case, there is nothing meaningful to return. If fid_as_index=True, then I suppose we should at least return an empty data frame with a non-empty index.

Alternatively, at the end of read_arrow we could detect if the arrow table has no columns after attempting to read it, and then return a truly empty table:

if table.num_columns == 0:
    # return empty pyarrow.Table if there were no columns read
    return pyarrow.Table.from_pylist([])

We could probably even do that preemptively before even attempting to read the data source (which may be remote and slower), but raising an exception during parameter validation seems perhaps a bit better to me. Otherwise, if we just preemptively return an empty data frame / table, then it may out of sync with a read against the actual data source, which may fail for any number of reasons. I.e., it might be better that we don't succeed when reading no geometry / columns / FID when we'd fail otherwise if any of those are true / non-empty.

Thoughts?

kylebarron commented 11 months ago

Otherwise, if we just preemptively return an empty data frame / table, then it may out of sync with a read against the actual data source, which may fail for any number of reasons. I.e., it might be better that we don't succeed when reading no geometry / columns / FID when we'd fail otherwise if any of those are true / non-empty

IMO it would make sense to error instead of returning an empty table, as that's more likely to imply user error that columns was empty

jorisvandenbossche commented 11 months ago

Alternatively, at the end of read_arrow we could detect if the arrow table has no columns after attempting to read it, and then return a truly empty table:

I personally wouldn't do this, as one could argue that the the current result with use_arrow=True is more correct than a "truly" empty table. Because you did read all rows, just no columns ..

But certainly fine with detecting this case and raising an error. I think it will typically indeed be a user error.

theroggy commented 11 months ago

I also can't really think of use cases where reading no columns nor a geometry from a file would lead to something useful... so I agree it smells like a user error...

When fid_as_index=True you could return some data even though no columns nor geometry is asked, but it still can't really think of a use for this either...