geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
259 stars 22 forks source link

Add support for fids filter with use_arrow=True #304

Closed theroggy closed 2 months ago

theroggy commented 9 months ago

Add support for the fids filter keyword.

Resolves #301

Note1: there is another test that can be enabled for use_arrow=True, but it depends on the force_2d support PR, #300 to be merged.

Note2: when reading without use_arrow=True, for the testcase we use gdal seems to be return the data in the order the fids were supplied. When using use_arrow=True, the rows are returned in ascending order of the fids, or rather (at least for db oriented data sources) probably in the order they are written on disk.

The order rows are returned typically is not guaranteed unless you explicitly ask to order by, at least for DB-like sources, so it is always a risk to depend on that in my opinion. However if the idea is that users rely on the order data is returned when using the fids filter, or that in this case having the order guaranteed is useful, sorting could be added at the geopandas level in pyogrio.

theroggy commented 9 months ago

I added some info about the order of the rows returned + how sorting can be applied to the docstring.

jorisvandenbossche commented 9 months ago

I think we should do some benchmarking here to see how good this approach actually is (I did some for the non-arrow use case, originally when working on the support for reading by FIDs, will try to dig them up).

Because if this is slower, then I am not sure we should provide this feature (if a user wants this, they can always do the where=f"OGC_FID IN ({','.join(fids.astype(str))})" themselves)

theroggy commented 8 months ago

I did some quick tests... and as often is the case, it depends on the drivers involved.

I think there are generally 3 types of data sources available through gdal:

  1. binary files where offsets can be used to efficiently find a record based on an fid, e.g. shapefile
  2. databases or database-like files, where indexes can be used, e.g. Geopackage
  3. text files, where the entire file needs to be read to find a specific record, e.g. GeoJson

For each individual driver there will still be significant differences depending on the (lack of) optimizations implemented and the inherent efficiency of the format, but I think testing these 3 will give a reasonable idea of what to expect in general.

Results:

  1. binary files: here it depends on how the sql statement is written:
    • if an sql statement with an fid IN (...) construct is used, the performance penalty compared to the non-arrow implementation is large: a factor 50 or more slower for shapefile.
    • if the sql statement is written as UNION ALL's with LIMIT... OFFSET, it is 2x as fast as the non-arrow implementation, but it is that fast (0,042 s vs 0,098 s) that my rough test isn't good enough to draw any more conclusions than: it is faster. It must be noted that this is a bit of a hacky way to implement it: it works and it is fast, but it isn't really pretty. At the moment the fid's returned also aren't correct yet, but this could be solved.
  2. databases or database-like files, where indexes can be used, e.g. Geopackage
    • if an sql statement with an fid IN (...) construct is used, it is 4x faster (but also too short timings for big conclusions).
    • if the sql statement is written as UNION ALL's with LIMIT... OFFSET, it is 10x slower
  3. text files, where the entire file needs to be read to find a specific record, e.g. GeoJson
    • here all implementations are roughly the same: they are all terribly slow.

For the shapefile performance, it seems that the fid IN (...) filter isn't really optimized for shapefiles, but LIMIT.... OFFSET is. As both are roughly functionally equivalent, I think fid IN (...) could be optimized as well, but not sure if it would get a high priority.

I used the files on this page for my tests: https://landbouwcijfers.vlaanderen.be/open-geodata-landbouwgebruikspercelen

Test script used:

from datetime import datetime
from pathlib import Path
import pyogrio

shp_path = r"C:\Temp\Landbouwgebruikspercelen_2023_-_Voorlopig_(extractie_26-06-2023)_SHP\Landbouwgebruikspercelen_2023_-_Voorlopig_(extractie_26-06-2023).shp"
gpkg_path = r"C:\Temp\Landbouwgebruikspercelen_2023_-_Voorlopig_(extractie_26-06-2023)_GPKG\Landbouwgebruikspercelen_2023_-_Voorlopig_(extractie_26-06-2023).gpkg"
fgb_path = (
    r"C:\Temp\Landbouwgebruikspercelen_2023_-_Voorlopig_(extractie_26-06-2023).fgb"
)
geojson_path = (
    r"C:\Temp\Landbouwgebruikspercelen_2023_-_Voorlopig_(extractie_26-06-2023).geojson"
)

if not Path(fgb_path).exists():
    input_df = pyogrio.read_dataframe(gpkg_path)
    pyogrio.write_dataframe(input_df, fgb_path)
if not Path(geojson_path).exists():
    input_df = pyogrio.read_dataframe(gpkg_path)
    pyogrio.write_dataframe(input_df, geojson_path)

for path in [shp_path, gpkg_path, geojson_path]:
    path = Path(path)
    start = datetime.now()
    df = pyogrio.read_dataframe(path, fids=[1, 100_000, 500_000], fid_as_index=True)
    print(
        f"read {path.suffix} with fids took {datetime.now()-start}, fids={df.index.values}, refids={df.REF_ID.values}"
    )

    if path.suffix == ".gpkg":
        sql = f"""
            SELECT * FROM (SELECT 1 AS FID, * FROM "{Path(path).stem}" LIMIT 1 OFFSET 1)
            UNION ALL
            SELECT * FROM (SELECT 100000, * FROM "{Path(path).stem}" LIMIT 1 OFFSET 100000)
            UNION ALL
            SELECT * FROM (SELECT 500000, * FROM "{Path(path).stem}" LIMIT 1 OFFSET 500000)
        """
    else:
        sql = f"""
            SELECT FID, * FROM "{Path(path).stem}" LIMIT 1 OFFSET 1
            UNION ALL
            SELECT FID, * FROM "{Path(path).stem}" LIMIT 1 OFFSET 100000
            UNION ALL
            SELECT FID, * FROM "{Path(path).stem}" LIMIT 1 OFFSET 500000
        """
    start = datetime.now()
    df = pyogrio.read_dataframe(path, use_arrow=True, sql=sql, fid_as_index=True)
    print(
        f"read {path.suffix}, use_arrow+sql (LIMIT OFFSET) took {datetime.now()-start}, fids={df.index.values}, refids={df.REF_ID.values}"
    )

    """
    where = "fid = 500000"
    start = datetime.now()
    df = pyogrio.read_dataframe(path, where=where, fid_as_index=True)
    print(
        f"read {path.suffix} with sql = 1 fid took {datetime.now()-start}, fids={df.index.values}, refids={df.REF_ID.values}"
    )

    where = "fid IN (1, 100000, 500000)"
    start = datetime.now()
    df = pyogrio.read_dataframe(path, where=where, fid_as_index=True)
    print(
        f"read {path.suffix} with sql took {datetime.now()-start}, fids={df.index.values}, refids={df.REF_ID.values}"
    )
    """

    where = "fid IN (1, 100000, 500000)"
    start = datetime.now()
    df = pyogrio.read_dataframe(path, use_arrow=True, where=where, fid_as_index=True)
    print(
        f"read {path.suffix}, use_arrow+sql (IN) took {datetime.now()-start}, fids={df.index.values}, refids={df.REF_ID.values}"
    )

Results:

read .shp with fids took 0:00:00.098213, fids=[     1 100000 500000], refids=[9.22212948e+08 8.27970778e+08 1.74964043e+09]
read .shp, use_arrow+sql (LIMIT OFFSET) took 0:00:00.042618, fids=[0 1 2], refids=[9.22212948e+08 8.27970778e+08 1.74964043e+09]
read .shp, use_arrow+sql (IN) took 0:00:06.790157, fids=[     0  99999 499999], refids=[4.87348814e+08 2.21252625e+09 2.19654883e+09]
read .gpkg with fids took 0:00:00.046391, fids=[     1 100000 500000], refids=[ 487348814 2212526247 2196548832]
read .gpkg, use_arrow+sql (LIMIT OFFSET) took 0:00:00.357586, fids=[     1 100000 500000], refids=[ 922212948  827970778 1749640429]
read .gpkg, use_arrow+sql (IN) took 0:00:00.012745, fids=[     1 100000 500000], refids=[ 487348814 2212526247 2196548832]
read .geojson with fids took 0:01:45.354370, fids=[     1 100000 500000], refids=[ 922212948  827970778 1749640429]
read .geojson, use_arrow+sql (LIMIT OFFSET) took 0:01:41.914098, fids=[0 1 2], refids=[ 922212948  827970778 1749640429]
read .geojson, use_arrow+sql (IN) took 0:01:45.597443, fids=[     1 100000 500000], refids=[ 922212948  827970778 1749640429]
theroggy commented 8 months ago

Because if this is slower, then I am not sure we should provide this feature (if a user wants this, they can always do the where=f"OGC_FID IN ({','.join(fids.astype(str))})" themselves)

It is a bit more fuss than just using that query... and feature parity-wise this is also not ideal, so I personally would provide the feature regardless of possible performance differences and document the circumstances when it is slower if it is the case as we have done in other cases.

brendan-ward commented 8 months ago

I think providing the functionality - since it allows parity between arrow and non-arrow modes - and documenting (briefly) that performance may vary widely by driver seems like a reasonable path forward.

theroggy commented 8 months ago

I think providing the functionality - since it allows parity between arrow and non-arrow modes - and documenting (briefly) that performance may vary widely by driver seems like a reasonable path forward.

The performance being driver dependent apparently was already mentioned, but I added that the value of use_arrow can also influence performance.

jorisvandenbossche commented 8 months ago

Sorry for the slow follow-up here on the promised benchmarks: the previous time we briefly discussed this was at https://github.com/geopandas/pyogrio/pull/19#issuecomment-947769038 (and the two comments below that).

It's quite similar as what Pieter shows above, performance being very driver dependent, the main summary of those older comments (and from testing shp, gpkg, geojson and fgb):

I think a lot also depends on the use case for the user specifying fids=. If it is for inspecting just a few rows like in the benchmarks Pieter showed above, then this is probably all just fine (even 50x slower for non-database files might be acceptable). But when you actually want to read a significant subset of a large file (eg 10% of a big file, which can still many thousands of rows), which was my original use case to implement this, then this kind of performance is really bad (and it will actually also just error). Also for this kind of use case, I think the UNION ALL approach is not really feasible (it would also create a huge sql query).

Of course, we can indeed document this clearly, and leave the choice to the users to use this keyword or not (given that we provide a good error message when it fails). But on the other hand, we could also leave this explicitly to the user (or downstream package), by suggesting a workaround for the missing fids support is to generate such a where query but that this support is also very driver specific.

jorisvandenbossche commented 8 months ago

As a comparison, for Shapefile, it can actually be faster to read the whole file and filter it after reading (and with the arrow reader, this can be done chunk by chunk, so that memory can still be low), compared to using a where filter (but this is still a lot slower than the non-arrow fid-based reader).
Although that is probably only the case in my synthetic benchmark data where the actual reading of the individual features is fast (single column + points); in real world data with more columns, I suspect the actual reading+filtering will become slower.

I re-ran some of the older benchmarks (+ added the arrow read+filter test): https://nbviewer.org/gist/jorisvandenbossche/f47549ec33edc234ac17b05a0bcaea69

theroggy commented 8 months ago

It's quite similar as what Pieter shows above, performance being very driver dependent, the main summary of those older comments (and from testing shp, gpkg, geojson and fgb):

  • It's faster for GPKG (compared to our current fids= implementation), but a lot slower for the other tested formats (i.e. essentially the database vs binary/text file mentioned above)

For text files it is, at least for my test, ~ the same. But performance and large text GIS files is a contradiction anyway, so not super interesting :-).

  • For formats other than GPKG, the generated SQL where query will fail when querying more than a few thousand rows ("Invalid SQL query ..."). If we keep this workaround, I think we need to catch that error, to provide a more informative error message to the user.

I know Oracle has a fixed limit of max. 1000 elements in an IN. I had a look at the limit in sqlite but it seemed like the limit is beyond any sensible number. I didn't think of OGRSQL, but apparently the limit there is rather in the league of the one imposed by oracle.

I think a lot also depends on the use case for the user specifying fids=. If it is for inspecting just a few rows like in the benchmarks Pieter showed above, then this is probably all just fine (even 50x slower for non-database files might be acceptable).

This is indeed the use case I was thinking about.

But when you actually want to read a significant subset of a large file (eg 10% of a big file, which can still many thousands of rows), which was my original use case to implement this, then this kind of performance is really bad (and it will actually also just error). Also for this kind of use case, I think the UNION ALL approach is not really feasible (it would also create a huge sql query).

Yes, for OGRSQL dependent datasources the practical limit seems to lie between 4000 and 5000 fids. There are several ways this can be stretched (clustering the fids and using "BETWEEN" clauses or "UNION ALL"'s, reading in a batch, inserting them in a temp file and joining on it,...), but I don't think it is already relevant at this point to implement such things. If anyone would ask for it for a sensible use case, further optimizations are possible and could be considered...

Do I understand correctly that you use the fids filtering in a practical way for thousands of fids? Could you elaborate on the use case?

Of course, we can indeed document this clearly, and leave the choice to the users to use this keyword or not (given that we provide a good error message when it fails). But on the other hand, we could also leave this explicitly to the user (or downstream package), by suggesting a workaround for the missing fids support is to generate such a where query but that this support is also very driver specific.

I personally think more users will benefit from the feature parity than there will be users that will run into the limits being imposed in the where implementation, so I'd go for feature parity and a clear error when it breaks.

theroggy commented 8 months ago

As a comparison, for Shapefile, it can actually be faster to read the whole file and filter it after reading (and with the arrow reader, this can be done chunk by chunk, so that memory can still be low), compared to using a where filter (but this is still a lot slower than the non-arrow fid-based reader). Although that is probably only the case in my synthetic benchmark data where the actual reading of the individual features is fast (single column + points); in real world data with more columns, I suspect the actual reading+filtering will become slower.

Same as above, if there are use cases where such an optimization would useful/needed, this could indeed also be a way to implement an optimization to this in pyogrio.

brendan-ward commented 8 months ago

If I recall correctly, some of where this came up in dask-geopandas was the idea of using feature bounds (via pyogrio.read_bounds) to develop a spatial partitioning of the data, and then using fids to actually read the data within a spatial partition (assuming FIDs in the data source are numbered incrementally starting at 0/1 depending on driver). Depending on the strategy, that could still be several thousand records rather than just a small sample for inspection.

with the arrow reader, this can be done chunk by chunk, so that memory can still be low

According to the notebook, this approach was ~10x faster than using the SQL IN clause, which is a compelling alternative and doesn't risk running into the same limits of maximum number of fids in the IN list. Probably a good idea for us to investigate this a bit more.

theroggy commented 8 months ago

Because in theory the sql fid filtering could be fast, I checked on the feasability of speading it up in gdal a few days ago and apparently it wasn't too hard because it has already been implemented: https://github.com/OSGeo/gdal/issues/8590

So a loop to read the fids per e.g. 1000 using the where filter could be a clean solution...

jorisvandenbossche commented 8 months ago

Thanks for the GDAL issue! ;)

Do I understand correctly that you use the fids filtering in a practical way for thousands of fids? Could you elaborate on the use case?

If I recall correctly, some of where this came up in dask-geopandas

Indeed, dask-geopandas was the reason that I was looking into this FID based reading. I don't know if anyone is using it in practice, but this allows in theory to process a big shapefile in parallel while ensuring each chunk has a specific spatial extent (now, in practice it's probably better to first convert the file to something like parquet, and only then do the spatial repartitioning). And in practice, at the moment I also only implemented the consecutive chunks with skip_features/max_features in the read_file function in dask-geopandas itself. So using it with FIDs is manual work at the moment.


For this PR, while I am a bit skeptical about needing this (I don't think we necessarily need exact keyword equivalency with or without arrow, we can also better document you can use the where filter to select on FID), I also won't object merging this. But to get this in, I think this PR at a minimum still needs 1) to raise a proper error message when there are to many FIDs and you get this invalid sql query (the user doesn't necessarily know this is done under the hood like this), 2) I think we can document the caveats a bit more explicitly.

brendan-ward commented 8 months ago

Suggestion: since we're not yet in agreement on the API, and there are ongoing changes in GDAL that may impact some of our direction here, let's omit defer this from the 0.7 release.

theroggy commented 7 months ago

New performance test with GDAL 3.8, that includes the optimization for fid IN () filtering:

The maximum number of FID's in the IN is 4.997 for "OGRSQL". For gpkg there is no explicit limit. I did some tests around this, and the limit is in number of elements, not in the number of digits or the length of the where filter.

Timings:

read .shp with fids=... , use_arrow=False took 0.2473, len(fids) = 4900, 19815.07 rows/s
read .shp with fids=... , use_arrow=False took 0.6951, len(fids) = 14900, 21434.88 rows/s
read .shp with fids=... , use_arrow=True took 0.1695, len(fids) = 4900, 28915.79 rows/s
error applying filter for 14900 fids; max. number for drivers with default SQL dialect 'OGRSQL' is 4997
read .gpkg with fids=... , use_arrow=False took 0.3692, len(fids) = 4900, 13271.88 rows/s
read .gpkg with fids=... , use_arrow=False took 1.0754, len(fids) = 14900, 13854.94 rows/s
read .gpkg with fids=... , use_arrow=True took 0.0675, len(fids) = 4900, 72560.98 rows/s
read .gpkg with fids=... , use_arrow=True took 0.1364, len(fids) = 14900, 109218.18 rows/s
read .fgb with fids=... , use_arrow=False took 0.1920, len(fids) = 4900, 25521.29 rows/s
read .fgb with fids=... , use_arrow=False took 0.5869, len(fids) = 14900, 25389.01 rows/s
read .fgb with fids=... , use_arrow=True took 0.1002, len(fids) = 4900, 48906.84 rows/s
error applying filter for 14900 fids; max. number for drivers with default SQL dialect 'OGRSQL' is 4997

Script:

from pathlib import Path
from timeit import timeit
import warnings
import pyogrio

warnings.filterwarnings("ignore")
shp_path = r"C:\Temp\prc2023\prc2023.shp"
gpkg_path = r"C:\Temp\prc2023\prc2023.gpkg"
fgb_path = r"C:\Temp\prc2023\prc2023.fgb"
geojson_path = r"C:\Temp\prc2023\prc2023.geojson"

if not Path(shp_path).exists():
    input_df = pyogrio.read_dataframe(gpkg_path)
    pyogrio.write_dataframe(input_df, shp_path)
if not Path(fgb_path).exists():
    input_df = pyogrio.read_dataframe(gpkg_path)
    pyogrio.write_dataframe(input_df, fgb_path)
if not Path(geojson_path).exists():
    input_df = pyogrio.read_dataframe(gpkg_path)
    pyogrio.write_dataframe(input_df, geojson_path)

nb_tests = 10
fids_max_ogr = [id for id in range(100_000, 104_900)]
fids_gt_max_ogr = [id for id in range(100_000, 114_900)]
for path in [shp_path, gpkg_path, fgb_path]:
    for use_arrow in [False, True]:
        for fids in [fids_max_ogr, fids_gt_max_ogr]:
            path = Path(path)
            try:
                time = timeit(
                    lambda: pyogrio.read_dataframe(
                        path, fids=fids, fid_as_index=True, use_arrow=use_arrow
                    ),
                    number=nb_tests,
                )
                time /= nb_tests
                print(
                    f"read {path.suffix} with fids=... , use_arrow={use_arrow} "
                    f"took {time:.4f}, len(fids) = {len(fids)}, "
                    f"{len(fids)/time:.2f} rows/s"
                )
            except Exception as ex:
                print(ex)
theroggy commented 2 months ago

Thanks for the updates and your patience here @theroggy .

Given the performance implications and changes on GDAL side, I think we should limit support for this in ogr_open_arrow to GDAL >= 3.8.0 and raise a runtime exception otherwise.

@brendan-ward Or a warning, so we avoid breaking without explicit need?

brendan-ward commented 2 months ago

Or a warning, so we avoid breaking without explicit need?

I guess a warning would be fine, though there is the risk of the user not seeing it and then getting frustrated that things are so slow. But as it is, we are still a ways out from making use_arrow the default, which is more likely to be the source of such issues.

theroggy commented 2 months ago

I guess a warning would be fine, though there is the risk of the user not seeing it and then getting frustrated that things are so slow. But as it is, we are still a ways out from making use_arrow the default, which is more likely to be the source of such issues.

Yes, it depends a bit from situation to situation if an error being thrown is more frustrating or things being slow... but in general I'm personally more stressed by things not working at all than by slowness.

I added a warning. I excluded Geopackage and GeoJSON from this, as my tests above showed that use_arrow=True has the same or better performance for those formats already for GDAL < 3.8.