Closed brendan-ward closed 9 months ago
I think this PR adds support for this in gdal 3.8: https://github.com/OSGeo/gdal/pull/8306
Specifically one commit (https://github.com/OSGeo/gdal/pull/8306/commits/248cf600ff40dbd2cccc5f4f74a8a8e82d5f804e): " OGRLayer::GetArrowStream(): do not issue ResetReading() at beginning of iteration, but at end instead, so SetNextByIndex() can be honoured". That's nice! (although of course will still take some time before we can use in released version, but we can already start using dependent on the GDAL version if we want)
This also drops a validation requirement that
skip_features
be less than the number of features available to read
+1
~Something else: if max_features
is small, you can set the MAX_FEATURES_IN_BATCH
option to avoid to get too much data from GDAL. We still need the current code to iterate until we have enough data and slice the last batch, because it is not guaranteed that that option is honored (I think it can depend on the driver)~
Forget what I said, we already do that through the batch_size
keyword, and you already set that keyword to max_features
if it is smaller.
Latest commit passes skip_features
to GDAL >= 3.8 instead of handling on our end, which should cut down on some of the overhead.
Progress toward feature parity between Arrow and non-Arrow read interfaces.
This adds support for
skip_features
andmax_features
toread_arrow
, which enables these to be passed through viaread_dataframe(path, use_arrow=True, skip_features=10, max_features=2)
so that the behavior is the same with and withoutuse_arrow
.I added a note to the introduction to describe the overhead involved:
use_arrow
isTrue
,skip_features
andmax_features
will incur additional overhead because all features up to the next batch size abovemax_features
(or size of data layer) will be read prior to slicing out the requested range of features. Ifmax_features
is less than the maximum Arrow batch size (65,536 features) onlymax_features
will be read. All features up toskip_features
are read from the data source and later discarded because the Arrow interface does not support randomly seeking a starting feature.This overhead is relative to reading via Arrow; based on my limited tests so far, it is still generally a lot faster to use Arrow even with these parameters than without Arrow.
This also drops a validation requirement that
skip_features
be less than the number of features available to read (originally we raised aValueError
). Since using a.slice
on a pyarrow Table with a value larger than the size of the original table happily returns an empty table, it made sense to take this approach throughout: if you ask for more features than available, you get back empty arrays / pyarrow Tables / (Geo)DataFrames.