geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

Documentation on where and SQL syntax #276

Closed tfardet closed 10 months ago

tfardet commented 10 months ago

I might have missed something, but the documentation link for the where keyword did not get me very far. From what I tested, it seems to work like in GDAL, so this link could be more helpful?

Something that would also be nice is an additional explanation on how to deal with:

theroggy commented 10 months ago

I might have missed something, but the documentation link for the where keyword did not get me very far. From what I tested, it seems to work like in GDAL, so this link could be more helpful?

Indeed, the information in the page currently being linked to isn't up-to-date anymore. It states e.g. that NOT and LIKE are not supported, but a quick test shows that they are.

I tested some specific things mentioned on the OGRSQL page (the page you suggest to link to) and you are right that that seems to contain a better and more and up-to-date description of the currently supported syntax.

The reason that page is being linked to is because this is the page being linked to in the relevant gdal C API documentation as well, so I'll log an issue in the gdal issue tracker to suggest the link being changed there as well.

EDIT: Looking further, it is important to note that the the SQL WHERE syntax to be used depends on the data source being read, the same way this works when you use the sql parameter with sql_dialect=None. However, using the where parameter you apparently cannot control the SQL dialect to be used.

For reference, the script I used to do some validations on the supported syntax:

from pathlib import Path
import warnings
import pyogrio

# Ignore all warnings
warnings.simplefilter("ignore")

url_shp = "https://github.com/theroggy/pysnippets/raw/main/pysnippets/pyogrio/polygon-parcel_31370.zip"
url_gpkg = "https://github.com/geofileops/geofileops/raw/main/tests/data/polygon-parcel.gpkg"
wheres = [
    None,
    "LBLHFDTLT LIKE 'Gras%'",
    "LBLHFDTLT LIKE 'gras%'",
    "LBLHFDTLT ILIKE 'gras%'",
    "LBLHFDTLT NOT LIKE 'Gras%'",
    "LBLHFDTLT != 'Grasklaver'",
    "LBLHFDTLT IN ('Hoofdgebouwen', 'Grasklaver')",
    f"ST_Area({{geometrycolumn}}) > 1000",
]

for where in wheres:
    for url in [url_shp, url_gpkg]:
        for sql_dialect in [None, "OGRSQL", "SQLITE"]:
            where_f = where
            if where is not None:
                geometrycolumn = "geom" if url.endswith(".gpkg") else "geometry"
                where_f = where.format(geometrycolumn=geometrycolumn)
            try:
                df = pyogrio.read_dataframe(url, where=where_f)
                # print(f"\nnb_rows with where: {where}: {len(df)}")
                # print(df["LBLHFDTLT"].unique())
            except Exception as ex:
                name = Path(url).name
                print(f"Error, where={where_f}, sql_dialect={sql_dialect} on {name}")
theroggy commented 10 months ago

Something that would also be nice is an additional explanation on how to deal with:

As a general remark, because we'll be changing the link to point to the OGRSQL page, it will probably be the way to go to add any additional information regarding the syntax to use to that page and or to the SQLITE sql dialect page. Because these pages are part of the gdal documentation, to get changes applied to it an issue will have to be opened in the gdal issue tracker.

  • booleans (should one use 0/1, true/false, True/False, something else?)

Specifically for this remark: I'm not sure if there is a generic answer to your questions/suggestions for clarification. As you probably know, gdal supports a lot of file formats and many file formats don't even support booleans at all. In that case you need to fallback to using an INTEGER field with e.g. 0/1 or a string with 'TRUE' or 'FALSE' or .... I didn't do any testing and I might be wrong, but I think it is likely that the only possible answer will be: it depends on the file format and/or the way it is saved in the file.

So personally I think the best way forward would be if you could do some tests on different file formats regarding booleans. Based on those experiences you could open an issue in the gdal issue tracker to suggest some clarifications, if possible.

  • potential None/Null/NaN differences and how to deal with them

Because OGRSQL is an sql dialect, the syntax that is being described there in the OGRSQL documentation is to use "IS NULL" and "IS NOT NULL". As far as I know, the distinction between None and Nan is not a thing in SQL and they both map to NULL.

theroggy commented 10 months ago

@tfardet the documentation of the where parameter of read_dataframe will be updated in the "stable" documentation once a new version of pyogrio will be released. However, you can already have a preview at the "latest" version with this link: https://pyogrio.readthedocs.io/en/latest/api.html#pyogrio.read_dataframe

I'll close this issue now, but if you think there are still things to be done on the pyogrio side, feel free to reopen it.

tfardet commented 10 months ago

Looks good, thanks for your detailed answers!