Make columnar-based access possible in addition to record-based model?

jorisvandenbossche commented 7 years ago

Related to https://github.com/geopandas/geopandas/issues/491 (exploration of ways to make data ingestion faster in geopandas)

Currently fiona exposes a model where you access access the data by records (eg by iterating over a collection, accessing one record at a time). When the goal is to load the full dataset (eg to put all records in a geopandas GeoDataFrame), this record-based access can give some performance overhead.

Therefore, I am wondering to what extent fiona would welcome additions to also make columnar-based access possible. With columnar-based access I mean that you could get the values of all records (so of the full collection) at once in an array per property and geometry.

snorfalorpagus commented 6 years ago

This is certainly possible from a technical perspective. I'd guess the biggest speed gains would come from copying the data from GDAL directly into a NumPy array, avoiding any intermediate Python objects. Fiona doesn't currently depend on NumPy for anything, so I think this would be best in an optional module (similar to what we've done with Shapely). This module could provide methods that wrap Session and a variant of FeatureBuilder.build to return arrays. For the geometry column it's probably easiest to return an array of WKB/WKT data, which could then be quickly parsed with Shapely. Obviously this wouldn't touch the existing behavior - just provide additional features for projects like geopandas.

sgillies commented 6 years ago

@snorfalorpagus @jorisvandenbossche I think that you may be overlooking how deeply record-based vector formats are. Except for the ones backed by relational databases, there is no efficient FROM layer SELECT foo_column, there's a requirement to iterate over the records in the data and build up a list of values.

Since we must iterate over the records of a shapefile (using OGR_L_GetNextFeature), I suggest we look at 3 smaller optimizations.

Add an option to fetch geometry as a WKB blob instead of GeoJSON. In Fiona the conversion is done in C and is fast, but skipping it would be even faster.
Add an option to ignore fields, including even the geometry field. @snorfalorpagus has largely completed this.
Add a collection iterator variant that returns a flatter tuple of fields instead of GeoJSON.

With these optimizations, GeoPandas could make numpy structured arrays from the new fiona collection iterator, yes? Or implement something like the read_csv in Pandas?

jorisvandenbossche commented 6 years ago

there's a requirement to iterate over the records in the data and build up a list of values.

Yes, but I think what @snorfalorpagus does is, while doing this iteration, directly filling in numpy arrays on the C level (so you don't need to build up a 'list' (in the python sense) of values).

I suggest we look at 3 smaller optimizations.

To be really sure about the need, I think we should try to do some benchmarking of the proof of concept @snorfalorpagus now made compared the the options you outline above.

With these optimizations, GeoPandas could make numpy structured arrays from the new fiona collection iterator, yes?

Yes, that would be possible, and it would certainly also be faster as the current implementation, but how much faster is difficult to say.

I would somewhat assume that the overhead of creating intermediate python objects (whether the full feature dict or flatter tuple) is the main bottleneck, but it might also be that the actual conversion from wkt to shapely/geopandas geometries takes the most time of the current geopandas.read_file and in that case option 1 could already give a lot of speed-up. But as I said, I would need to do some timings to actually assess this.

brendan-ward commented 4 years ago

I'm running an experiment with this idea over in pyogrio. The idea is to use the bare essentials from fiona to create a numpy-oriented API to vector data sources.

Right now, it borrows heavily from the internals of fiona and takes inspiration from #540. I think based on what we learn there, it could certainly help inform how to approach this in fiona. (pyogrio doesn't necessarily need to be a long-lived project, it's more of an experiment than anything else at the moment)

It's still very early; I haven't even added write capability yet. But I thought it would be good to share some of the early results.

The current API is intended as a direct read-to-memory function similar to read_file in geopandas, instead of a more typical open file then read approach. This lets us sidestep the need for a session or other intermediate classes. We read straight into typed numpy arrays for WKB geometries and for each of the attributes. At the top level, we return fields as a list of ndarrays instead of using structured arrays.

I created quick initial benchmarks with Natural Earth data (Admin 0 and 1 at 10m and 110m) and recent versions of fiona, geopandas, and pygeos.

Compared to fiona:

1.6x faster listing of layers in single-layer data source
1.6x - 5x faster reading of small data sources (Natural Earth 10m and 110m Admin 0 and Admin 1 levels) compared to list of records in fiona

Compared to geopandas::read_file (into shapely objects) versus converting WKB here to pygeos objects:

6.5 - 16.5x faster reading of data into geometry-backed data frames

(note: this is not an apples-to-apples comparison, many things are conflated here including time to create shapely objects, etc)

These results suggest that while there are some speedups to be had in fiona using a vectorized approach, the major speedups here are at the cross-library level from avoiding intermediate representations (other than WKB). Using WKB as an intermediate format in Python between two C / Cython backends seems to be what really helps us here.

/cc @jorisvandenbossche @sgillies @snorfalorpagus

caspervdw commented 3 years ago

I have been reading up on this issue. Columnar access (reading) would be a major performance increase for applications of pygeos and specifically some applications I will be working on in the near future. Are you still open for including this in Fiona? I will have time to work on this.

There seem to be two parts to this issue:

Different serialization of Geometry objects (as described in https://github.com/Toblerity/fiona-rfc/blob/master/rfc/0001-fiona-2-0-changes.md and https://github.com/Toblerity/Fiona/pull/787)
Bulk reading into 1D numpy arrays (work in https://github.com/Toblerity/Fiona/pull/540 and https://github.com/brendan-ward/pyogrio)

Maybe some of you are willing to shortly discuss ideas in a video session? Or shall I write up some API plan so that you can shoot?

@sgillies @snorfalorpagus @brendan-ward @jorisvandenbossche

brendan-ward commented 3 years ago

Apologies for not following up on this sooner. I'm definitely very interested in discussions around this in whatever form they take. I also am very interested in finding out if this is of interest to integrate into Fiona or manage in a stand-alone project.

Since my post above, I've added write capability to pyogrio, better integration with GeoPandas, and have been using it for all my projects. Updated performance benchmarks here but in short, it is substantially faster than Fiona + GeoPandas. Some of the internals borrow a bit from Fiona, but the overall API is quite different.

I'd kindly suggest that pyogrio as a basis for considering how to approach this in Fiona or a stand-alone project because it provides a minimally-complete implementation of the vectorized approach for I/O using GDAL/OGR. (note: pyogrio hasn't yet been cleaned up to the point where I'm ready to promote it as a stand-alone project, including proper credit for the parts borrowed from Fiona. Please view it as a functional experiment to help inform our direction here)

If this isn't of interest to integrate into Fiona, there have been some discussions and interest around migrating it to a GeoPandas organization project in Github and integrating it as an optional dependency in GeoPandas (i.e., I'm not possessive of it, I just need the performance benefits). It does need a bit of work on the packaging side to make it more broadly accessible.

There are significant performance wins regardless of where or how this lives on. I think the speedups are largely attributable to using a vectorized approach, and avoiding unnecessary geometry representation changes (esp. true for Fiona + GeoPandas).

I don't want to fracture the community, but I also see a few tradeoffs to where this might live that I'll try to outline below. In particular, I'm keenly aware that trying to merge in a vectorized approach to Fiona could be a substantial undertaking and I most definitely do not want to impose on the goodwill of the Fiona maintainers.

Overall approach:

pyogrio uses a single-shot API to entire read or write data; it does not use a session / context manager similar to Fiona; this enabled all the work within Cython to be fully-vectorized (no iterators).
it assumes a data structure that has close parity to valid representations in the data sources they are written to. Specifically, it assumes that each column has a single data type, and geometries are of the mixture of types allowed for the OGR driver (e.g., you can mix Polygons and MultiPolygons for some drivers, but get bad results from mixes of Points and Polygons).
it exposes a public-facing API to read / write GeoPandas GeoDataFrames using WKB as the geometry representation read from data sources and parsed into GeoDataFrame geometries using pygeos (there may be better or more performant geometry representations, but this was very convenient because OGR provides WKB export and pgeos can easily consume it).
internally, it uses numpy arrays to hold either columns of attribute data or an object array of WKB geometry data. This API could also be used more directly alongside pygeos, though encapsulating everything within GeoDataFrames is more convenient.
pyogrio doesn't have a great way to round-trip specific metadata for field types or other datasource / data layer level information, primarily because there was no way to store arbitrary metadata in a GeoDataFrame. Instead, it infers OGR types (sometimes poorly) from the data types of the arrays being written out. This is an area that Fiona handles better (but is also lost on conversion to / from GeoDataFrames), and certainly an area where pyogrio could be improved.
pyogrio was written for GDAL 2.4+, so it avoids some of the legacy issues around multiple GDAL versions. It currently has no adapters for different GDAL versions, but as far as I can tell, hasn't needed them. I'm running 3.2 locally with no change in code.
it does not perform validation of incoming data, and error handling is not so great yet
pyogrio exposes a minimal API compared to Fiona and usage so far has been limited to a small number of well-behaved drivers and reasonably clean data. There may well be design impacts imposed by less-common drivers or less cleanly structured data that are better accounted for in Fiona or at least better known to Fiona devs.

Some challenges / tradeoffs to integrating this into Fiona:

pyogrio uses a different approach "all the way down" - which could mean parallel Cython codebases for interacting with OGR in vectorized vs iterator-oriented methods. The vectorized approach did not adapt well to the current Fiona approach of sessions and iterators. It is entirely possible that I'm thinking about this too narrowly, so I defer to Fiona maintainers about how easy it would be to adapt to conditionally or entirely use a vectorized API internally.
introduces a dependency on numpy. Not necessarily a problem, but does add to the weight of Fiona for users that don't need the vectorized API.
pyogrio is specifically targeted at GeoPandas; it is not intended to be as general-purpose as Fiona. As such, my hunch is that GeoPandas would be a bigger beneficiary of pyogrio than Fiona, esp. because the speedups are more significant for Fiona + GeoPandas than Fiona alone.

Maintenance considerations

Regardless of where this ultimately lives, I think the bulk of the code maintenance challenges for the vectorized side of things will be at the interface with GDAL / OGR, in the Cython tier. This is a bit specialized and requires tracking changes in GDAL, so there may be some marginal gains by centralizing that effort into a single project.
vectorized vs unvectorized access to data may engage different parts of the community, so there may be some benefits to having these as two companion projects that are more narrowly focused, rather than trying to meet all the needs in a single project.
If separate, I could see cross-pollination between the two projects, but this would take dedicated effort

I'd really like to know what Fiona maintainers think about how we might approach this going forward: separate project or integrated directly into Fiona?

sgillies commented 3 years ago

@brendan-ward I'm inclined to close this issue. Few GIS formats support column-based access well enough and it seems like pyogrio has a great start on this and might be able to solve the core problem if it doesn't have to complicate itself with weird GIS concerns of fiona,

rbuffat commented 3 years ago

As @sgillies pointed out in https://github.com/Toblerity/Fiona/issues/469#issuecomment-359489182 the biggest bottleneck in the fiona + geopandas use case is probably the conversion of geometries to Python dictionaries and then again to binary geometries. The second biggest bottleneck is probably if more fields are converted to Python datatype as actually needed. The actual performance impact is hard to estimate without proper benchmarking though. I assume that for most data processing use cases the data io is a minor part of processing time, thus improvements in speed in the real world will probably only be noticeable for very large datasets. GDAL supports a broad variety of GIS formats, but I suspect only a really small number of formats are really suitable for large datasets. As Fiona, as well as GDAL, are designed for row-based access to the data, I was thinking it might worthwhile to directly use libraries optimized for such formats, e.g. spatiaLite to squeeze out the best performance.

brendan-ward commented 3 years ago

I've found file I/O to be a major bottleneck for some of my projects, compared to other data processing operations.

I think it is important to note that in pyogrio, we are using the same common formats as here (shapefile, geopackage, etc), and at the OGR level, we are using the same OGR operations. There is nothing there using optimized access for columnar formats; it uses a row-based inner loop. What is different is that each column is stored into its own array while reading, so that the return value from a read operation is a set of arrays.

I'm realizing now that by stating "vectorized all the way down" I may have contributed to some confusion about this. Maybe a better way of saying this is that from a Python perspective, it is vectorized; the loops in Cython account for the row-oriented structure of GIS data.

Here are some of benchmarks that may be helpful here:

Natural Earth Countries (Admin 0) - 1:100M (~1MB shapefile)

Read:

fiona: 29.6 ms
pyogrio: 18.2 ms

Write to shapefile:

fiona: 599.4 ms
pyogrio: 47.6 ms

Natural Earth Countries (Admin 0) - 1:10M (~9MB shapefile)

Read:

fiona: 229.5 ms
pyogrio: 46.7 ms

Write:

fiona: 2,281 ms
pyogrio: 244.7 ms

I think the differences are noticeable even for small datasets.

Among other things, I think this hints at some possible optimizations in Fiona even without adopting a vectorized approach, especially for writing data. We'd need to do a bit more profiling to see where the hotspots are, but one difference may be between the level of data validation before writing that Fiona does whereas pyogrio does none.

mwtoews commented 2 years ago

xref https://github.com/OSGeo/gdal/pull/5830 adopted as RFC 86: Column-oriented read API for vector layers with target GDAL 3.6

sgillies commented 2 years ago

Anyone have an idea what a nice Python API for this would look like? Would the usage be something like this?

with fiona.open("example.shp") as collection:
    df = pandas.DataFrame.from_records(collection.column(fields=["name", "area"]))

kylebarron commented 2 years ago

Judging from the RFC, and especially the bench_ogr_batch.cpp example, it looks like GDAL will expose a stream of Arrow record batches?

Then to get an iterator over pandas DataFrames you could imagine something like:

with fiona.open("example.shp") as collection:
    for record_batch in collection.iter_batches(fields=["name", "area"]):
        df = pa.Table.from_batches([record_batch]).to_pandas()

sgillies commented 2 years ago

@kylebarron thanks for the suggestion! That makes a lot of sense and is consistent with In [10] under https://arrow.apache.org/docs/python/ipc.html#using-streams. I'm going to add this to the 1.9 milestone and start digging into GDAL's new API.

sgillies commented 2 years ago

In #1121 I'm able to use OGR's new API with Cython, but I haven't made much progress towards a nice Python API yet.

sgillies commented 1 year ago

I've removed this from the 1.9 milestone. I'm thinking that Fiona 1.9 and 2.0 should stick to rows and let some other package take care of column-based vector data.

jorisvandenbossche commented 1 year ago

And for future readers (that haven't read all of the above), one "other package" that we are developing for geopandas that focuses on columnar-based IO is pyogrio: https://github.com/geopandas/pyogrio/ (and this also exposes the RFC86 new columnar-oriented read API of GDAL)

Toblerity / Fiona

Make columnar-based access possible in addition to record-based model? #469