geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
257 stars 21 forks source link

Prevent tags from ending in "other_tags" for osm.pbf with ``read_dataframe`` #419

Open tfardet opened 3 weeks ago

tfardet commented 3 weeks ago

At the moment, not all tags are imported from an osm.pbf file with read_dataframe. In particular, I'm interested in "buildings:levels", which currently ends up in "other_tags" rather than getting its own column (maybe because not all lines have this entry?).

Is there a way to tell pyogrio to include this tag as an independent column? I could extract it from the data in the "other_tags" column (e.g. "building:levels"=>"1") but this is probably going to be much slower than if it's done directly on import.

EDIT: in case anyone is interested in a workaround in the meantime

# missing entries will be either NaN or None
building_levels = df.other_tags.str.extract(r'"building:levels"=>"(\d+)"', expand=False)
rouault commented 1 week ago

Is there a way to tell pyogrio to include this tag as an independent column?

this is purely a OGR OSM driver topic. You can tune its configuation: see https://gdal.org/drivers/vector/osm.html#configuration

tfardet commented 1 week ago

OK, if I understand correctly, I would need to add "building:levels" to the "attributes" in the osmconf.ini file, right?

However, that means manual changes to a config file, meaning that I cannot rely on that for code that will be distributed to others. Is there a way to achieve this GDAL configuration programmatically via pyogrio or some other python library?

rouault commented 1 week ago

I would need to add "building:levels" to the "attributes" in the osmconf.ini file, right?

yes

I'm not super familiar with pyogrio, but looking at https://github.com/geopandas/pyogrio/blob/af292e579572a6a33a22a6403873b8e7b0a9d7f6/docs/source/known_issues.md?plain=1#L94 and given that the config file can be passed as an open option (https://gdal.org/drivers/vector/osm.html#open-options)

I assume you could do something like df = read_dataframe(path, CONFIG_FILE="/path/to/your/osmconf.ini")

with your code creating a temporary file

That could be a "/vsimem/" in-memory file (https://gdal.org/user/virtual_file_systems.html#vsimem-in-memory-files)

If you use GDAL Python bindings, then you can create it with something like

from osgeo import gdal
f = gdal.VSIFOpenL("/vsimem/osmconf.ini", "wb")
data = b"put here content of osmconf.ini"
gdal.VSIFWriteL(data, 1, len(data), f)
gdal.VSIFCloseL(f)

df = read_dataframe(path, CONFIG_FILE="/vsimem/osmconf.ini")

gdal.Unlink("/vsimem/osmconf.ini")
tfardet commented 1 week ago

Thanks, I'll check whether this config file argument works with a custom ini file

ajnisbet commented 1 week ago

I ran into this exact issue today! I didn't try the vsimem approach, but was able to use a regular temporary file.

I wanted to modify a copy of the system ini file programmatically rather than editing it manually. Python's configparser won't eat an ini file without a top-level header, so I had to add one pre-read then remove it post-write.

import configparser
import io
import tempfile

import geopandas as gpd

with tempfile.NamedTemporaryFile("w", suffix=".ini") as f_tmp:

    # Prefix the file with a toplevel header, then pass to configparser.
    dummy_header = "[dummy_toplevel_header]\n"
    config = configparser.ConfigParser()
    with open("/usr/share/gdal/osmconf.ini") as f_config:
        stream = io.StringIO(dummy_header + f_config.read())
        config.read_file(stream)

    # Set the config for the building layer: add levels and related tags, remove other_tags.
    config["multipolygons"]["attributes"] = "name,building:levels,levels,height,min_height,max_height"
    config["multipolygons"]["other_tags"] = "no"

    # Write to temp file.
    config.write(f_tmp, space_around_delimiters=False)
    f_tmp.flush()

    # Remove the first line dummy header.
    with open(f_tmp.name, "r") as f:
        lines = f.readlines()
    with open(f_tmp.name, "w") as f:
        assert lines[0] == dummy_header
        f.writelines(lines[1:])

    # Now you can read in the file.
    gdf = gpd.read_file(osm_pbf_path, layer="multipolygons", CONFIG_FILE=f_tmp.name)

Result:

image
rouault commented 1 week ago

Python's configparser won't eat an ini file without a top-level header, so I had to add one pre-read then remove it post-write.

will be fixed per https://github.com/OSGeo/gdal/pull/10293