Closed CaptainInler closed 10 months ago
Thanks for reporting this! I am able to reproduce the error locally.
One way to sidestep this is to use the use_arrow=True
option on read_file
.
The problem appears to be that GDAL is returning -1 for the number of features in each layer; ogrinfo
also returns -1 for the feature count.
We use the feature count in various places to allocate arrays that we then populate when iterating over features, even when we tell it to give us the count the slow way (by iterating over all features) if the driver doesn't support a fast count (this driver does not).
The only idea I'm coming up with for this is that if we get back a -1 from GDAL, that we do our own loop over all records first to get the count, use that to allocate arrays, and then iterate over all the records again. That seems pretty inefficient, but maybe is no different than what GDAL would be doing normally to get a count of records.
Related StackOverflow post from 9(!) years ago.
Looking at the GDAL source code, feature count is specifically disabled for this driver, probably because of the performance impact of iterating over all features for OSM format.
One way to sidestep this is to use the
use_arrow=True
option onread_file
.
As far as I see this loads only Elements with Pointgeometries..
Never mind.
import geopandas as gpd
url = "https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf"
gdf = gpd.read_file(url, engine="pyogrio", layer="lines", use_arrow=True)
does the trick.. :sweat_smile:
@brendan-ward I try to load this file:
import pyogrio
url = "http://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, layer="multipolygons")
but this returns an empty dataframe:
pgdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 0 entries
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 osm_id 0 non-null object
1 osm_way_id 0 non-null object
2 name 0 non-null object
3 type 0 non-null object
4 aeroway 0 non-null object
5 amenity 0 non-null object
6 admin_level 0 non-null object
7 barrier 0 non-null object
8 boundary 0 non-null object
9 building 0 non-null object
10 craft 0 non-null object
11 geological 0 non-null object
12 historic 0 non-null object
13 land_area 0 non-null object
14 landuse 0 non-null object
15 leisure 0 non-null object
16 man_made 0 non-null object
17 military 0 non-null object
18 natural 0 non-null object
19 office 0 non-null object
20 place 0 non-null object
21 shop 0 non-null object
22 sport 0 non-null object
23 tourism 0 non-null object
24 other_tags 0 non-null object
25 geometry 0 non-null geometry
dtypes: geometry(1), object(25)
memory usage: 132.0+ bytes
Doing the same with url = "https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf"
returns a filled dataframe.
Also when downsizing the file,
osmium tags-filter baden-wuerttemberg-latest.osm.pbf a/admin_level -o baden-wuerttemberg-admin.osm.pbf
the output returns a filled dataframe.
Is this related to the original issue?
Based on a lead found by @brendan-ward in #272 I gave the following snippet a try, and it seems to work, even though it crashes on my laptop without LIMIT clause because I don't have enough memory to load this entire file:
import pyogrio
url = "http://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, sql="SELECT * FROM multipolygons LIMIT 100")
@CaptainInler I believe I have a fix for this now in #271; it uses a similar approach as ogr2ogr
to set the layers to read from the file, which works for both use_arrow=True
and without. It works properly for the larger dataset you specify above as well as other large files I've tested.
However, because it has to do 2 passes over a lot of records the OSM file, not all of which are in the layer you read, it is slow to read from a remote URL. I added some additional recommendations around this as part of #271, but in short, I highly recommend you download the file first.
I tried this
returned this error:
Since the traceback so kindly asks to open an issue, I could not resist doing so... :smiley: