geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

use arrow by default? #278

Open martinfleis opened 10 months ago

martinfleis commented 10 months ago

I wasn't following the implementation of use_arrow in detail, so I'm wondering - would it make sense to default to True for cases where all of the options are compatible with it? Maybe changing the actual default to None and having a smart heuristic deciding whether it could be used to provide a performance boost or not due to compatibility issues.

I would like to avoid the same situation we have now in geopandas, where we have a faster way or reading and writing files than the default but you have to opt-in (engine="pyogrio"). If arrow can be used at least for a subset of use cases automatically, I think it can be beneficial.

Is there anything major blocking that?

kylebarron commented 10 months ago

I think the GDAL Arrow integration is groundbreaking in its potential but is very underused right now, so I'd love to see it become the default in the future. I do understand, however, that for compatibility reasons some might be hesitant to make it the default.

compatible with it

If I'm reading it correctly, the cases not supported by the Arrow integration are these: https://github.com/geopandas/pyogrio/blob/94c2bb474af2b5d254369a519b6ca8f27b732640/pyogrio/_io.pyx#L1182-L1191

I suppose these aren't supported by GDAL? For skip_features and max_features, it would seem those are easy to implement on the pyogrio side, though it would need a slight refactor, as those params would be used at a higher level when reading batches from the RecordBatchReader, and not in ogr_open_arrow itself.

One other downside of using Arrow by default is that it makes pyarrow a required dependency, and pyarrow can be quite large. (But this may be moot if/when pandas makes pyarrow a required dependency)

brendan-ward commented 10 months ago

I'm open to the idea of making Arrow automatic if pyarrow is already installed and the parameters are compatible. The only place we've seen performance of Arrow be slower (and it wasn't that much slower) was reading OSM data, which may point to some issues on the GDAL side (haven't had a chance to investigate / report those further); all other cases show that Arrow is faster.

We'd have to check about GDAL support for the parts we're using to implement skip_features / max_features. I'm not sure that we want to implement them after reading from the data source, since part of their intent was to avoid reading unused features at all (and avoid associated memory for such, esp. for larger files). I'm not sure that either of them are all that commonly used in practice though; they're mostly handy for tests...

theroggy commented 10 months ago

I looked at the impact to make the switch in geofileops and even though my tests there aren't meant to be a full coverage at all to all possible incompatibilities between fiona/pyogrio/pyogrio+arrow, only two (easy to fix) issues/differences turned up: https://github.com/geopandas/pyogrio/issues/263 and https://github.com/geopandas/pyogrio/issues/262.

Maybe a beta version could be packaged with arrow as default to make it easy to plug it in in the CI tests of some projects we are involved in to try to get a better coverage of potential issues?

brendan-ward commented 10 months ago

Unfortunately, it looks like Arrow support is disabled for GDAL in VCPK (we use VCPKG for building wheels). I'm not familiar enough with VCPKG to know if it is possible for us to get Arrow installed ourselves and then override the GDAL VCPKG port to build against Arrow.

brendan-ward commented 10 months ago

GDAL ignores setting the feature index to read (via OGR_L_SetNextByIndex) when using arrow, so that isn't a drop-in fix.

If max_features is set, we can iterate over batches up to skip_features + max_features, then slice out the range we want. We can even set batch_size to max_features if it is less than 65536 and only read out that many features, which should be lower memory. Otherwise, we end up reading a bit too much (up to part of one extra batch) and then tossing out the extra.

Using a skip_features > batch_size is not great though, because we have to incrementally read and discard batches up to that point, but this seems to be inherent in using arrow in general.

theroggy commented 10 months ago

GDAL ignores setting the feature index to read (via OGR_L_SetNextByIndex) when using arrow, so that isn't a drop-in fix.

If max_features is set, we can iterate over batches up to skip_features + max_features, then slice out the range we want. We can even set batch_size to max_features if it is less than 65536 and only read out that many features, which should be lower memory. Otherwise, we end up reading a bit too much (up to part of one extra batch) and then tossing out the extra.

Using a skip_features > batch_size is not great though, because we have to incrementally read and discard batches up to that point, but this seems to be inherent in using arrow in general.

Another option to deal with this is via sql using e.g. following sql statement:

f'SELECT * FROM "{layername}" OFFSET {skip_features} LIMIT {max_features}'

To be tested what performance is for non-sqlite based file formats...

jorisvandenbossche commented 9 months ago

Unfortunately, it looks like Arrow support is disabled for GDAL in VCPK (we use VCPKG for building wheels).

The use_arrow support using GDAL's OGR_L_GetArrowStream is implemented without any dependency on the Arrow C++ library. GDAL only needs this dependency for reading Parquet format and Arrow/IPC files (https://gdal.org/drivers/vector/arrow.html). So our wheels actually already support using use_arrow=True if the user has pyarrow installed for the conversion on our side.

We could consider doing something similar: support consuming the ArrowStream without actually having a hard dependency on pyarrow. I don't how important this use case of pyogrio is, but it would address the possible concern about adding pyarrow as required dependency:

One other downside of using Arrow by default is that it makes pyarrow a required dependency, and pyarrow can be quite large. (But this may be moot if/when pandas makes pyarrow a required dependency)

One option for this could even be to just re-use the code of the GDAL python bindings, where they have an implementation of GetArrowStreamAsNumPy. And long term I would actually like to see a generic implementation of this in a lightweight package like the python bindings for nanoarrow that makes it easier to work with ArrowStream pointers without the pyarrow dependency (https://github.com/apache/arrow-nanoarrow/issues/53).

jorisvandenbossche commented 9 months ago

And to be explicit: I agree very much with the gist of Martin's top post: ideally we give users just the best default (for geopandas we now ask users to specify engine="pyogrio", it would be good to avoid that after we made that the default, we again have to ask users to manually specify use_arrow=True to again get the best option)

jorisvandenbossche commented 9 months ago

Something else: it would be a good start that we ourselves start to test the use_arrow=True more extensively (i.e. parametrize all relevant tests, Brendan's PR at https://github.com/geopandas/pyogrio/pull/286 should be a good start for that). For example from testing the upcoming support for timezones (https://github.com/geopandas/pyogrio/pull/253) with the ArrowStream reader, I noticed that timezones were also not yet implemented there (https://github.com/OSGeo/gdal/issues/8460, and in the meantime there is already a PR to fix that for GDAL 3.8)

theroggy commented 9 months ago

I think it would be practical to be able to overrule the default values for those kind of parameters in an environment variable.

Par example for use_arrow: in read_dataframe(), change the default value of use_arrow=None. If the value is None, check if the environment variable PYOGRIO_USE_ARROW exists and take the value there, if not, use whatever the hardcoded default is at that time.

This makes it easy to change/test this for anyone without making any code changes. If there are issues, you can just as easily roll it back or if there are only issues in some specific places, you can force one or the other specifically there.

Also in tests it is practical to deal with it like that: make a fixture that sets the environment variable first to True then to False. But, obviously it is also not a problem to use a pytest mark/fixture and add use_arrow=use_arrow to every read_dataframe call for the pyogrio tests.

In geopandas, for the engine keyword I think this might be practical as well.

martinfleis commented 9 months ago

Thorough testing is a good idea for sure.

I agree that having pyarrow as hard dependency is a no-go but I'd suggest setting default to None and use arrow by default if pyarrow is available and standard reading of it is not. That feels like an optimal balance between defaulting to the fastest path and depending on arrow.

It is a bit similar as what a lot of packages do with numba. It works without it but it is much better if numba is installed.