geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
272 stars 22 forks source link

Large OSM files require different approach to reading features #272

Open brendan-ward opened 1 year ago

brendan-ward commented 1 year ago

271 tries to count records directly from an OSM data source, which appears to work for small files, but produces inconsistent results for larger datasets (e.g., Alabama), which appears to be an artifact of layers being an artificial construct within OSM files rather than a meaningful structure within the dataset.

According to the GDAL docs, by default too many features may accumulate for a given layer while attempting to read a different layer, and produce an error. If you then enable interleaved reading, this avoids the error, but iterating over features may produce a count of 0 instead of the true amount.

From the example for interleaved reading in the GDAL docs, it is not immediately obvious how to alter our reading workflow. Perhaps we need to determine the index of the requested layer, and read all features of all previous layers first? ogr2ogr appears to do the right thing here, so we should look at that for reference.

brendan-ward commented 1 year ago

Looks like the trick is to detect that the data source is OSM with interleaved reading, and then execute an SQL statement to set the specific layer of interest.

We can use OGR_DS_ExecuteSQL to do this after opening the data source.

theroggy commented 1 year ago

I suppose use_arrow by default doesn't treat this any better, or does it?

-> based on my own test in the context of #269, use_arrow has the same problems except that it is 2 x as fast, most likely because it probably doesn't count the features first before fetching them...

tfardet commented 1 year ago

I can confirm that either #271 or some other recent commit solves the issue for me (main is working while 0.6.0 gave me empty dataframes).

As mentioned in #269, use_arrow=True is still necessary, though.

brendan-ward commented 1 year ago

271 has been merged into main (should have resolved this as part of that); it should no longer be necessary for use_arrow=True; per manual testing it has been verified to work for larger OSM files on MacOS and Windows.

If it isn't working yet for you, please share more details about what part isn't working.

tfardet commented 1 year ago

Ah right, the issue without arrow is associated to a DataLayerError: Could not iterate over features: Non closed ring detected. because OGR_GEOMETRY_ACCEPT_UNCLOSED_RING was set to "NO", so unrelated, sorry for the noise. You can probably close this issue then :)

EDIT: on second thought, is it normal that the behavior differs with and without arrow? Should I open an issue about this?

jorisvandenbossche commented 1 year ago

It might still be an issue to report to GDAL to also honor that setting when exporting to Arrow.

theroggy commented 1 year ago

Ah right, the issue without arrow is associated to a DataLayerError: Could not iterate over features: Non closed ring detected. because OGR_GEOMETRY_ACCEPT_UNCLOSED_RING was set to "NO", so unrelated, sorry for the noise. You can probably close this issue then :)

EDIT: on second thought, is it normal that the behavior differs with and without arrow? Should I open an issue about this?

@tfardet I wonder... OGR_GEOMETRY_ACCEPT_UNCLOSED_RING is not supposed to give an error by default... it should only give a warning. Is it possible you set OGR_GEOMETRY_ACCEPT_UNCLOSED_RING="NO" somewhere? Maybe in a environment variable?

import pyogrio

print(
    "OGR_GEOMETRY_ACCEPT_UNCLOSED_RING: "
    f"{pyogrio.get_gdal_config_option('OGR_GEOMETRY_ACCEPT_UNCLOSED_RING')}"
)

Result:

OGR_GEOMETRY_ACCEPT_UNCLOSED_RING: None
tfardet commented 1 year ago

@tfardet I wonder... OGR_GEOMETRY_ACCEPT_UNCLOSED_RING is not supposed to give an error by default... it should only give a warning. Is it possible you set OGR_GEOMETRY_ACCEPT_UNCLOSED_RING="NO" somewhere? Maybe in a environment variable?

Yes, I set it manually to get rid of the warning about the non closed ring with Arrow, where it seems to simply discard the related geometry. But it raises an error without Arrow.

theroggy commented 1 year ago

Ok, I understand now.

The variable name implies that you specify if you want to Accept unclosed rings or not. So if you set it to "NO", this means you don't Accept them and hence an error is raised. If you want to get rid of the warning without error you should set it to "YES", not "NO".

The inconsistent behaviour/bug is in the arrow implementation: if you set OGR_GEOMETRY_ACCEPT_UNCLOSED_RING="NO" it should also give and error if consistency is wanted.

theroggy commented 1 year ago

@tfardet do you open an issue in https://github.com/OSGeo/gdal/issues for this?