geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

read osm.pbf #269

Closed CaptainInler closed 10 months ago

CaptainInler commented 11 months ago

I tried this

import geopandas as gpd
url = "https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf"
gdf = gpd.read_file(url, engine="pyogrio")

returned this error:

/opt/conda/lib/python3.11/site-packages/pyogrio/raw.py:137: UserWarning: Layer 'b'points'' does not have any features to read
  result = ogr_read(

---------------------------------------------------------------------------
FeatureError                              Traceback (most recent call last)
Cell In [3], line 1
----> 1 gdf = gpd.read_file(url, engine="pyogrio")

File /opt/conda/lib/python3.11/site-packages/geopandas/io/file.py:271, in _read_file(filename, bbox, mask, rows, engine, **kwargs)
    268             from_bytes = True
    270 if engine == "pyogrio":
--> 271     return _read_file_pyogrio(filename, bbox=bbox, mask=mask, rows=rows, **kwargs)
    273 elif engine == "fiona":
    274     if pd.api.types.is_file_like(filename):

File /opt/conda/lib/python3.11/site-packages/geopandas/io/file.py:427, in _read_file_pyogrio(path_or_bytes, bbox, mask, rows, **kwargs)
    424     kwargs["read_geometry"] = False
    426 # TODO: if bbox is not None, check its CRS vs the CRS of the file
--> 427 return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)

File /opt/conda/lib/python3.11/site-packages/pyogrio/geopandas.py:149, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, fid_as_index, use_arrow, **kwargs)
    146 path_or_buffer = _stringify_path(path_or_buffer)
    148 read_func = read_arrow if use_arrow else read
--> 149 result = read_func(
    150     path_or_buffer,
    151     layer=layer,
    152     encoding=encoding,
    153     columns=columns,
    154     read_geometry=read_geometry,
    155     force_2d=force_2d,
    156     skip_features=skip_features,
    157     max_features=max_features,
    158     where=where,
    159     bbox=bbox,
    160     fids=fids,
    161     sql=sql,
    162     sql_dialect=sql_dialect,
    163     return_fids=fid_as_index,
    164     **kwargs,
    165 )
    167 if use_arrow:
    168     meta, table = result

File /opt/conda/lib/python3.11/site-packages/pyogrio/raw.py:137, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, fids, sql, sql_dialect, return_fids, **kwargs)
    134 dataset_kwargs = _preprocess_options_key_value(kwargs) if kwargs else {}
    136 try:
--> 137     result = ogr_read(
    138         path,
    139         layer=layer,
    140         encoding=encoding,
    141         columns=columns,
    142         read_geometry=read_geometry,
    143         force_2d=force_2d,
    144         skip_features=skip_features,
    145         max_features=max_features or 0,
    146         where=where,
    147         bbox=bbox,
    148         fids=fids,
    149         sql=sql,
    150         sql_dialect=sql_dialect,
    151         return_fids=return_fids,
    152         dataset_kwargs=dataset_kwargs,
    153     )
    154 finally:
    155     if buffer is not None:

File /opt/conda/lib/python3.11/site-packages/pyogrio/_io.pyx:1046, in pyogrio._io.ogr_read()

File /opt/conda/lib/python3.11/site-packages/pyogrio/_io.pyx:745, in pyogrio._io.get_features()

FeatureError: GDAL returned more records than expected based on the count of records that may meet your combination of filters against this dataset.  Please open an issue on Github (https://github.com/geopandas/pyogrio/issues) to report encountering this error.

Since the traceback so kindly asks to open an issue, I could not resist doing so... :smiley:

brendan-ward commented 11 months ago

Thanks for reporting this! I am able to reproduce the error locally.

One way to sidestep this is to use the use_arrow=True option on read_file.

The problem appears to be that GDAL is returning -1 for the number of features in each layer; ogrinfo also returns -1 for the feature count.

We use the feature count in various places to allocate arrays that we then populate when iterating over features, even when we tell it to give us the count the slow way (by iterating over all features) if the driver doesn't support a fast count (this driver does not).

The only idea I'm coming up with for this is that if we get back a -1 from GDAL, that we do our own loop over all records first to get the count, use that to allocate arrays, and then iterate over all the records again. That seems pretty inefficient, but maybe is no different than what GDAL would be doing normally to get a count of records.

brendan-ward commented 11 months ago

Related StackOverflow post from 9(!) years ago.

Looking at the GDAL source code, feature count is specifically disabled for this driver, probably because of the performance impact of iterating over all features for OSM format.

CaptainInler commented 11 months ago

One way to sidestep this is to use the use_arrow=True option on read_file.

As far as I see this loads only Elements with Pointgeometries..

Never mind.

import geopandas as gpd
url = "https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf"
gdf = gpd.read_file(url, engine="pyogrio", layer="lines", use_arrow=True)

does the trick.. :sweat_smile:

CaptainInler commented 11 months ago

@brendan-ward I try to load this file:

import pyogrio
url = "http://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, layer="multipolygons")

but this returns an empty dataframe:

pgdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 0 entries
Data columns (total 26 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   osm_id       0 non-null      object  
 1   osm_way_id   0 non-null      object  
 2   name         0 non-null      object  
 3   type         0 non-null      object  
 4   aeroway      0 non-null      object  
 5   amenity      0 non-null      object  
 6   admin_level  0 non-null      object  
 7   barrier      0 non-null      object  
 8   boundary     0 non-null      object  
 9   building     0 non-null      object  
 10  craft        0 non-null      object  
 11  geological   0 non-null      object  
 12  historic     0 non-null      object  
 13  land_area    0 non-null      object  
 14  landuse      0 non-null      object  
 15  leisure      0 non-null      object  
 16  man_made     0 non-null      object  
 17  military     0 non-null      object  
 18  natural      0 non-null      object  
 19  office       0 non-null      object  
 20  place        0 non-null      object  
 21  shop         0 non-null      object  
 22  sport        0 non-null      object  
 23  tourism      0 non-null      object  
 24  other_tags   0 non-null      object  
 25  geometry     0 non-null      geometry
dtypes: geometry(1), object(25)
memory usage: 132.0+ bytes

Doing the same with url = "https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf" returns a filled dataframe.
Also when downsizing the file,

osmium tags-filter baden-wuerttemberg-latest.osm.pbf a/admin_level -o baden-wuerttemberg-admin.osm.pbf

the output returns a filled dataframe.

Is this related to the original issue?

theroggy commented 11 months ago

Based on a lead found by @brendan-ward in #272 I gave the following snippet a try, and it seems to work, even though it crashes on my laptop without LIMIT clause because I don't have enough memory to load this entire file:

import pyogrio
url = "http://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, sql="SELECT * FROM multipolygons LIMIT 100")
brendan-ward commented 11 months ago

@CaptainInler I believe I have a fix for this now in #271; it uses a similar approach as ogr2ogr to set the layers to read from the file, which works for both use_arrow=True and without. It works properly for the larger dataset you specify above as well as other large files I've tested.

However, because it has to do 2 passes over a lot of records the OSM file, not all of which are in the layer you read, it is slow to read from a remote URL. I added some additional recommendations around this as part of #271, but in short, I highly recommend you download the file first.