`skip_features` increases in time dramatically

aw-west-defra commented 1 year ago

I am attempting to read a few rows of a dataset, however skip_features takes a very long time when trying to read later features.
I have no recommendation, but would like to bring this to your attention.

Example

It took 1h to read:
pyogrio.read_dataframe(filepath, skip_features=1_000_000, max_features=100)
But I used an equivalent where function that only took 20s:
pyogrio.read_dataframe(filepath, where='fid BETWEEN {} AND {}'.format(1_000_000, 1_000_100))

Table of timeit results

	`skip_features`, `max_feature`	`where='fid BETWEEN {} AND {}'`
0	23.3s	23.1s
10,000	48.8s	23.4s
end	1h39min16s	24.6s

Notebook Example

import pyogrio

filepath = '<filepath>'
batch = 100
n = pyogrio.read_info(f)['features']

%timeit pyogrio.read_dataframe(filepath, skip_features=0, max_features=batch)
%timeit pyogrio.read_dataframe(filepath, where='fid BETWEEN {} AND {}'.format(0, batch))
%timeit pyogrio.read_dataframe(filepath, skip_features=10_000, max_features=batch)
%timeit pyogrio.read_dataframe(filepath, where='fid BETWEEN {} AND {}'.format(10_000, 10_000+batch))
%timeit pyogrio.read_dataframe(filepath, skip_features=n-batch, max_features=batch)
%timeit pyogrio.read_dataframe(filepath, where='fid BETWEEN {} AND {}'.format(n-batch, n))

>>> 23.3 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 23.1 s ± 164 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 48.8 s ± 1.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 23.4 s ± 95.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 1h 39min 16s ± 1min 54s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 24.6 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

(Apologies I should've used an example dataset. My example is a zipped gpkg with mixed geometry types and 1,539,825 features.)

brendan-ward commented 1 year ago

Thanks for reporting @aw-west-defra ; nearly 2 hours is crazy slow.

Is this zipped GPKG local on your machine or is it on the network? (if the latter, that brings in more moving parts)

It looks like GPKG supports random reads but does not support fast setting of the next index. We're using OGR_L_SetNextByIndex internally when you pass skip_features, which will vary widely in performance by driver because GDAL warns us that if the driver doesn't support fast setting of next index, it has to scan all records up to that point.

The capabilities listed via read_info() confirm this (same for zipped GPKG but would have added overhead of unzipping):

from pyogrio import read_info

read_info('/tmp/test.gpkg')['capabilities']
# {'random_read': 1, 'fast_set_next_by_index': 0, 'fast_spatial_filter': 1}

I am able to reproduce (on MacOS 12.6.5, M1, Python 3.10, GDAL 3.6.4) some of the slowdown with a test dataset of LineStrings that has 2.1M features, but not to the extreme extent as you.

Reading zipped GPKG is much slower than unzipped:

from timeit import timeit

timeit("tmp = read_dataframe('/tmp/test.gpkg', skip_features=n-100)", number=1, globals=globals())
# 3.27s

timeit("tmp = read_dataframe('/tmp/test.gpkg.zip', skip_features=n-100)", number=1, globals=globals())
# 101.58s

I also found that using where is MUCH faster than skip_features:

timeit(f"tmp = read_dataframe('/tmp/test.gpkg', where='fid BETWEEN {n-100} AND {n}')", number=1, globals=glo
bals())
# 0.029s

timeit(f"tmp = read_dataframe('/tmp/test.gpkg.zip', where='fid BETWEEN {n-100} AND {n}')", number=1, globals
=globals())
# 4.47s

If you know that your FIDs are in incremental order, you can also pass an iterable of FIDs to read instead of using a where, but the performance is the same:

timeit(f"tmp = read_dataframe('/tmp/test.gpkg', fids=range({n-100}, n))", number=1, globals=globals())
# 0.029s

timeit(f"tmp = read_dataframe('/tmp/test.gpkg.zip', fids=range({n-100}, n))", number=1, globals=globals())
# 4.65s

In this case, it seems expected that this will be slow to read via skip_features especially when adding zip overhead to that. Is it possible to first unzip the GPKG before working with it?

If your FIDs are not sequential, you can read them first and then slice into this, but I'm finding this to be very slow too because of zip overhead (seems at least as bad as unzipping the full file first then reading it):

from pyogrio.raw import read

fids = read('/tmp/test.gpkg.zip', read_geometry=False, columns=[], return_fids=True)[1]

It is pretty quick if the GPKG is unzipped first.

I see that we don't have strong warnings in the docs about performance implications of skip_features, so I added #256 to address this.

aw-west-defra commented 1 year ago

My data is on a databricks filesystem, dbfs, this hasn't been a slow down for other tasks. I previously compared the speed between zipped and unzipped gpkg, finding zipped often read faster, potentially due to the high compression of my datasets. Is FID not always in order?

Docs seems like a great solution, and this can be close when you want.

brendan-ward commented 1 year ago

I think with GPKG, you can set your own FIDs, so it is possible to write them out of order or at least with a non-incremental series (i.e., gaps, not starting at a consistent value, etc). Depends on how those files were originally written. Normally they are incremental and consistent, but would be worth a check.

Note that some other drivers don't let you write the FID, so they are always incremental and predictable.

I'm surprised at your findings of zipped being faster for some tasks; are those things that involve reading the whole file? I'd expect that the tradeoff would be between the overhead of unzipping the data (and potential penalties of having to unzip more of the data in order to do a random read) versus file transfer speed: i.e., if transfer speed is slow, then you wouldn't notice zip overhead, but if file transfer speed is fast (e.g., local files) then I'd expect the zip overhead to be noticeable. I've never worked with databricks so I can't speak to that specifically. But either way, go with what works best in your case, having done the tests to find the ideal solution vs clearly non-optimal solution (1 hour 40 minutes, yikes!)

theroggy commented 9 months ago

For completeness sake: all of the above is still correct, but skip features will be a bit faster in gdal >= 3.8 because of an optimisation implemented in it. Nonetheless, filtering on fid will still stay way faster and recommended if possible for database oriented drivers: https://github.com/OSGeo/gdal/pull/8306

geopandas / pyogrio