Toblerity / Fiona

Fiona reads and writes geographic data files
https://fiona.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.16k stars 202 forks source link

Make columnar-based access possible in addition to record-based model? #469

Closed jorisvandenbossche closed 1 year ago

jorisvandenbossche commented 7 years ago

Related to https://github.com/geopandas/geopandas/issues/491 (exploration of ways to make data ingestion faster in geopandas)

Currently fiona exposes a model where you access access the data by records (eg by iterating over a collection, accessing one record at a time). When the goal is to load the full dataset (eg to put all records in a geopandas GeoDataFrame), this record-based access can give some performance overhead.

Therefore, I am wondering to what extent fiona would welcome additions to also make columnar-based access possible. With columnar-based access I mean that you could get the values of all records (so of the full collection) at once in an array per property and geometry.

snorfalorpagus commented 6 years ago

This is certainly possible from a technical perspective. I'd guess the biggest speed gains would come from copying the data from GDAL directly into a NumPy array, avoiding any intermediate Python objects. Fiona doesn't currently depend on NumPy for anything, so I think this would be best in an optional module (similar to what we've done with Shapely). This module could provide methods that wrap Session and a variant of FeatureBuilder.build to return arrays. For the geometry column it's probably easiest to return an array of WKB/WKT data, which could then be quickly parsed with Shapely. Obviously this wouldn't touch the existing behavior - just provide additional features for projects like geopandas.

sgillies commented 6 years ago

@snorfalorpagus @jorisvandenbossche I think that you may be overlooking how deeply record-based vector formats are. Except for the ones backed by relational databases, there is no efficient FROM layer SELECT foo_column, there's a requirement to iterate over the records in the data and build up a list of values.

Since we must iterate over the records of a shapefile (using OGR_L_GetNextFeature), I suggest we look at 3 smaller optimizations.

  1. Add an option to fetch geometry as a WKB blob instead of GeoJSON. In Fiona the conversion is done in C and is fast, but skipping it would be even faster.
  2. Add an option to ignore fields, including even the geometry field. @snorfalorpagus has largely completed this.
  3. Add a collection iterator variant that returns a flatter tuple of fields instead of GeoJSON.

With these optimizations, GeoPandas could make numpy structured arrays from the new fiona collection iterator, yes? Or implement something like the read_csv in Pandas?

jorisvandenbossche commented 6 years ago

there's a requirement to iterate over the records in the data and build up a list of values.

Yes, but I think what @snorfalorpagus does is, while doing this iteration, directly filling in numpy arrays on the C level (so you don't need to build up a 'list' (in the python sense) of values).

I suggest we look at 3 smaller optimizations.

To be really sure about the need, I think we should try to do some benchmarking of the proof of concept @snorfalorpagus now made compared the the options you outline above.

With these optimizations, GeoPandas could make numpy structured arrays from the new fiona collection iterator, yes?

Yes, that would be possible, and it would certainly also be faster as the current implementation, but how much faster is difficult to say.

I would somewhat assume that the overhead of creating intermediate python objects (whether the full feature dict or flatter tuple) is the main bottleneck, but it might also be that the actual conversion from wkt to shapely/geopandas geometries takes the most time of the current geopandas.read_file and in that case option 1 could already give a lot of speed-up. But as I said, I would need to do some timings to actually assess this.

brendan-ward commented 4 years ago

I'm running an experiment with this idea over in pyogrio. The idea is to use the bare essentials from fiona to create a numpy-oriented API to vector data sources.

Right now, it borrows heavily from the internals of fiona and takes inspiration from #540. I think based on what we learn there, it could certainly help inform how to approach this in fiona. (pyogrio doesn't necessarily need to be a long-lived project, it's more of an experiment than anything else at the moment)

It's still very early; I haven't even added write capability yet. But I thought it would be good to share some of the early results.

The current API is intended as a direct read-to-memory function similar to read_file in geopandas, instead of a more typical open file then read approach. This lets us sidestep the need for a session or other intermediate classes. We read straight into typed numpy arrays for WKB geometries and for each of the attributes. At the top level, we return fields as a list of ndarrays instead of using structured arrays.

I created quick initial benchmarks with Natural Earth data (Admin 0 and 1 at 10m and 110m) and recent versions of fiona, geopandas, and pygeos.

Compared to fiona:

Compared to geopandas::read_file (into shapely objects) versus converting WKB here to pygeos objects:

(note: this is not an apples-to-apples comparison, many things are conflated here including time to create shapely objects, etc)

These results suggest that while there are some speedups to be had in fiona using a vectorized approach, the major speedups here are at the cross-library level from avoiding intermediate representations (other than WKB). Using WKB as an intermediate format in Python between two C / Cython backends seems to be what really helps us here.

/cc @jorisvandenbossche @sgillies @snorfalorpagus

caspervdw commented 3 years ago

I have been reading up on this issue. Columnar access (reading) would be a major performance increase for applications of pygeos and specifically some applications I will be working on in the near future. Are you still open for including this in Fiona? I will have time to work on this.

There seem to be two parts to this issue:

Maybe some of you are willing to shortly discuss ideas in a video session? Or shall I write up some API plan so that you can shoot?

@sgillies @snorfalorpagus @brendan-ward @jorisvandenbossche

brendan-ward commented 3 years ago

Apologies for not following up on this sooner. I'm definitely very interested in discussions around this in whatever form they take. I also am very interested in finding out if this is of interest to integrate into Fiona or manage in a stand-alone project.

Since my post above, I've added write capability to pyogrio, better integration with GeoPandas, and have been using it for all my projects. Updated performance benchmarks here but in short, it is substantially faster than Fiona + GeoPandas. Some of the internals borrow a bit from Fiona, but the overall API is quite different.

I'd kindly suggest that pyogrio as a basis for considering how to approach this in Fiona or a stand-alone project because it provides a minimally-complete implementation of the vectorized approach for I/O using GDAL/OGR. (note: pyogrio hasn't yet been cleaned up to the point where I'm ready to promote it as a stand-alone project, including proper credit for the parts borrowed from Fiona. Please view it as a functional experiment to help inform our direction here)

If this isn't of interest to integrate into Fiona, there have been some discussions and interest around migrating it to a GeoPandas organization project in Github and integrating it as an optional dependency in GeoPandas (i.e., I'm not possessive of it, I just need the performance benefits). It does need a bit of work on the packaging side to make it more broadly accessible.

There are significant performance wins regardless of where or how this lives on. I think the speedups are largely attributable to using a vectorized approach, and avoiding unnecessary geometry representation changes (esp. true for Fiona + GeoPandas).

I don't want to fracture the community, but I also see a few tradeoffs to where this might live that I'll try to outline below. In particular, I'm keenly aware that trying to merge in a vectorized approach to Fiona could be a substantial undertaking and I most definitely do not want to impose on the goodwill of the Fiona maintainers.

Overall approach:

Some challenges / tradeoffs to integrating this into Fiona:

Maintenance considerations

I'd really like to know what Fiona maintainers think about how we might approach this going forward: separate project or integrated directly into Fiona?

sgillies commented 3 years ago

@brendan-ward I'm inclined to close this issue. Few GIS formats support column-based access well enough and it seems like pyogrio has a great start on this and might be able to solve the core problem if it doesn't have to complicate itself with weird GIS concerns of fiona,

rbuffat commented 3 years ago

As @sgillies pointed out in https://github.com/Toblerity/Fiona/issues/469#issuecomment-359489182 the biggest bottleneck in the fiona + geopandas use case is probably the conversion of geometries to Python dictionaries and then again to binary geometries. The second biggest bottleneck is probably if more fields are converted to Python datatype as actually needed. The actual performance impact is hard to estimate without proper benchmarking though. I assume that for most data processing use cases the data io is a minor part of processing time, thus improvements in speed in the real world will probably only be noticeable for very large datasets. GDAL supports a broad variety of GIS formats, but I suspect only a really small number of formats are really suitable for large datasets. As Fiona, as well as GDAL, are designed for row-based access to the data, I was thinking it might worthwhile to directly use libraries optimized for such formats, e.g. spatiaLite to squeeze out the best performance.

brendan-ward commented 3 years ago

I've found file I/O to be a major bottleneck for some of my projects, compared to other data processing operations.

I think it is important to note that in pyogrio, we are using the same common formats as here (shapefile, geopackage, etc), and at the OGR level, we are using the same OGR operations. There is nothing there using optimized access for columnar formats; it uses a row-based inner loop. What is different is that each column is stored into its own array while reading, so that the return value from a read operation is a set of arrays.

I'm realizing now that by stating "vectorized all the way down" I may have contributed to some confusion about this. Maybe a better way of saying this is that from a Python perspective, it is vectorized; the loops in Cython account for the row-oriented structure of GIS data.

Here are some of benchmarks that may be helpful here:

Natural Earth Countries (Admin 0) - 1:100M (~1MB shapefile)

Read:

Write to shapefile:

Natural Earth Countries (Admin 0) - 1:10M (~9MB shapefile)

Read:

Write:

I think the differences are noticeable even for small datasets.

Among other things, I think this hints at some possible optimizations in Fiona even without adopting a vectorized approach, especially for writing data. We'd need to do a bit more profiling to see where the hotspots are, but one difference may be between the level of data validation before writing that Fiona does whereas pyogrio does none.

mwtoews commented 2 years ago

xref https://github.com/OSGeo/gdal/pull/5830 adopted as RFC 86: Column-oriented read API for vector layers with target GDAL 3.6

sgillies commented 2 years ago

Anyone have an idea what a nice Python API for this would look like? Would the usage be something like this?

with fiona.open("example.shp") as collection:
    df = pandas.DataFrame.from_records(collection.column(fields=["name", "area"]))
kylebarron commented 2 years ago

Judging from the RFC, and especially the bench_ogr_batch.cpp example, it looks like GDAL will expose a stream of Arrow record batches?

Then to get an iterator over pandas DataFrames you could imagine something like:

with fiona.open("example.shp") as collection:
    for record_batch in collection.iter_batches(fields=["name", "area"]):
        df = pa.Table.from_batches([record_batch]).to_pandas()
sgillies commented 2 years ago

@kylebarron thanks for the suggestion! That makes a lot of sense and is consistent with In [10] under https://arrow.apache.org/docs/python/ipc.html#using-streams. I'm going to add this to the 1.9 milestone and start digging into GDAL's new API.

sgillies commented 2 years ago

In #1121 I'm able to use OGR's new API with Cython, but I haven't made much progress towards a nice Python API yet.

sgillies commented 1 year ago

I've removed this from the 1.9 milestone. I'm thinking that Fiona 1.9 and 2.0 should stick to rows and let some other package take care of column-based vector data.

jorisvandenbossche commented 1 year ago

And for future readers (that haven't read all of the above), one "other package" that we are developing for geopandas that focuses on columnar-based IO is pyogrio: https://github.com/geopandas/pyogrio/ (and this also exposes the RFC86 new columnar-oriented read API of GDAL)