ENH: Add support for skip_features, max_features for read_arrow

brendan-ward commented 10 months ago

Progress toward feature parity between Arrow and non-Arrow read interfaces.

This adds support for skip_features and max_features to read_arrow, which enables these to be passed through via read_dataframe(path, use_arrow=True, skip_features=10, max_features=2) so that the behavior is the same with and without use_arrow.

I added a note to the introduction to describe the overhead involved:use_arrow is True, skip_features and max_features will incur additional overhead because all features up to the next batch size above max_features (or size of data layer) will be read prior to slicing out the requested range of features. If max_features is less than the maximum Arrow batch size (65,536 features) only max_features will be read. All features up to skip_features are read from the data source and later discarded because the Arrow interface does not support randomly seeking a starting feature.

This overhead is relative to reading via Arrow; based on my limited tests so far, it is still generally a lot faster to use Arrow even with these parameters than without Arrow.

This also drops a validation requirement that skip_features be less than the number of features available to read (originally we raised a ValueError). Since using a .slice on a pyarrow Table with a value larger than the size of the original table happily returns an empty table, it made sense to take this approach throughout: if you ask for more features than available, you get back empty arrays / pyarrow Tables / (Geo)DataFrames.

theroggy commented 9 months ago

I think this PR adds support for this in gdal 3.8: https://github.com/OSGeo/gdal/pull/8306

jorisvandenbossche commented 9 months ago

Specifically one commit (https://github.com/OSGeo/gdal/pull/8306/commits/248cf600ff40dbd2cccc5f4f74a8a8e82d5f804e): " OGRLayer::GetArrowStream(): do not issue ResetReading() at beginning of iteration, but at end instead, so SetNextByIndex() can be honoured". That's nice! (although of course will still take some time before we can use in released version, but we can already start using dependent on the GDAL version if we want)

jorisvandenbossche commented 9 months ago

This also drops a validation requirement that skip_features be less than the number of features available to read

+1

jorisvandenbossche commented 9 months ago

~Something else: if max_features is small, you can set the MAX_FEATURES_IN_BATCH option to avoid to get too much data from GDAL. We still need the current code to iterate until we have enough data and slice the last batch, because it is not guaranteed that that option is honored (I think it can depend on the driver)~

Forget what I said, we already do that through the batch_size keyword, and you already set that keyword to max_features if it is smaller.

brendan-ward commented 9 months ago

Latest commit passes skip_features to GDAL >= 3.8 instead of handling on our end, which should cut down on some of the overhead.

geopandas / pyogrio

ENH: Add support for skip_features, max_features for read_arrow #282