Writing performance - Githubissues

jorisvandenbossche commented 5 years ago

There has been a previous issue about the slow writing of GeoPackage files: https://github.com/Toblerity/Fiona/issues/476. But triggered by https://gis.stackexchange.com/questions/302811/how-to-get-fast-writing-with-geopandas-fiona, I was further looking into it with the latest versions of GeoPandas and Fiona, and it still seems relatively slow.

It already improved a lot compared with the previous versions of both GeoPandas and Fiona. And of the remaining time, GeoPandas takes the most time (which I will try to fix, cfr https://github.com/geopandas/geopandas/issues/863). But even then, writing a file with 100k rows and 5 attribute columns takes ca 10s with Fiona.

Sample set-up:

import pandas as pd
import geopandas
import fiona
import shapely.geometry

import random
import string

N = 100000

df = geopandas.GeoDataFrame(
    {'a': np.random.randn(N), 'b': np.random.randn(N),
     'c': np.random.randn(N), 'd': np.random.randint(100, size=N),
     'e': [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5)) for _ in range(N)],
     'geometry': [shapely.geometry.Point(random.random(), random.random()) for _ in range(N)]})

records = list(df.iterfeatures())
schema = geopandas.io.file.infer_schema(df)

with fiona.Env():
    with fiona.open("test_geopackage.gpkg", 'w', driver="GPKG", schema=schema) as colxn:
        colxn.writerecords(records)

Timing only the fiona-writing part (using Fiona 1.8.1, GDAL 2.3 with Python 3.6 on Ubuntu 16.04, installed with conda-forge):

In [37]: %%timeit
    ...: with fiona.Env():
    ...:     with fiona.open("test_geopackage_profile2.gpkg", 'w', driver="GPKG", schema=schema) as colxn:
    ...:         colxn.writerecords(records)
11.4 s ± 284 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Full example notebook exploring the performance of writing that dataframe to GPKG: http://nbviewer.jupyter.org/gist/jorisvandenbossche/c8590a3617698befad527e66eefb7f5b

As comparison, writing the same dataset with QGIS takes only a couple of seconds, and reading / writing it with ogr2ogr also takes less time (around 5s for both reading and writing it).

So the main question I am wondering: would there still be ways to improve this in Fiona?

I suppose that part of it is inherent to the design / the fact that we have the data in Python objects and need to convert to OGR objects. Possibilities that I was thinking about to further explore: would (optionally) turn of some validation steps make a difference? Would using WKB instead of the mapping as intermediate geometry object make a difference? But maybe you already know the answer to those questions, or see other possibilities.

culebron commented 5 years ago

I saw people posting tricks of writing faster with OGR from Python, and their suggestion was to avoid locking the db. I'm not sure what is the bottleneck here (probably, need to write to memory files to compare).

sgillies commented 5 years ago

@jorisvandenbossche the first step is to use a Python profiler on the code you have under %%timeit. This will point us to the functions that we spend most of our time in and we can begin to optimize those. Fiona's OGRFeatureBuilder is written in C (it's a Cython cdef class) but I would not be surprised if there are marginal gains to be had.

jorisvandenbossche commented 5 years ago

I will need to recompile fiona locally with profile=True to get more insight (to also include cython level profiling), but one thing that the python profiler already shows: 70% of time is spent in WritingSession.writerecs, and 30% in closing the collection (which calls WritingSession.sync). Comparing to GeoJSON, 100% of time is spent in actual writing, and the closing step does not appear prominently in the timing.

sgillies commented 5 years ago

@jorisvandenbossche I took advice from https://julien.danjou.info/guide-to-python-profiling-cprofile-concrete-case-carbonara/ to install pyproj2calltree and kcachegrind. These are quite handy. I've used them to profile and make a call graph of the following script.


from itertools import chain, repeat
import fiona

with fiona.Env():

    with fiona.open("tests/data/coutwildrnp.shp") as collection:
        features = chain.from_iterable(repeat(list(collection), 2000))

        with fiona.open("/tmp/out.gpkg", "w", schema=collection.schema, crs=collection.crs, driver="GPKG") as dst:
            dst.writerecords(features)

This opens the test file, reads its features, multiplies them 2000x times, and writes them to a new geopackage file. Since our transaction size is 20,000, there are 7 transactions involved. According to my analysis, committing the transactions costs us very little. We spend less than 1% of the time committing transactions.

gpkg_commit

Apologies for the poorly cropped image.

Here's a look near writerecords.

gpkg_write

We spend 85% of the time writing features out. 32% is spent constructing OGR geometry objects and some 11% of the time in there is somewhat wasted on debug log messages. The geometry builder is one place to look for improvements.

If the feature source provided OGR objects and we skipped GeoJSON deserialization (hypothetical!) the geometry builder would not be needed and ~20% could be eliminated.

There's another 50% of the overall cost in writerecs that is more opaque. This is likely the best code in which to look for improvements. A different tool might be needed because we can't see into OGR from cProfile. Wrapping OGR_L_CreateFeature (https://github.com/Toblerity/Fiona/blob/master/fiona/ogrext.pyx#L1179) up in a Cython function will give us a little more data, too.

sgillies commented 5 years ago

Here's my quick assessment of available speedups for Fiona 1.8.x.

We can delete code in the geometry and feature building classes that calls the logger's debug method more than one time time per feature.
We can delete code in those classes that logs even once per feature for an additional, smaller speedup. We'll give up insight into our data processing pipeline, so we should have a little discussion into whether we want to keep some debug logging.
We can skip or greatly reduce validation of features and geometries. Checking each input feature's properties and geometry against the destination's schema is expensive and duplicates features of OGR (though I think Fiona's validation is a bit more complete). Being more specific: instead of validating the schema before we attempt to create an OGR Feature, we could catch errors that arise in creating the feature and then do validation after to give users a super useful exception.
We can improve dispatch of feature properties to OGR field setters. The current approach is very naive and doesn't involve the session object, which could probably remember which setter should be used for a property.

I did a quick experiment and found that implementing (crudely) 1-3 above produced a 5x speedup in my script (above).

Thoughts @jorisvandenbossche @snorfalorpagus @culebron ? Are any of you available to work on such things in the next few months?

culebron commented 5 years ago

@sgillies I want to help, but need to see how to use testing tools.

jorisvandenbossche commented 5 years ago

@sgillies thanks a lot for the exploration!

We'll give up insight into our data processing pipeline, so we should have a little discussion into whether we want to keep some debug logging.

I don't know a lot about the logging / cython / c combination, so this might be a very naive suggestion. But would there we a way to opt-in for debug messages? A if debug: logger.debug(..) might be already give a good speed-up (and avoids the python interaction with the logging module inside the loop), although this defeats a bit the purpose of the logging levels.

We can skip or greatly reduce validation of features and geometries. Checking each input feature's properties and geometry against the destination's schema is expensive and duplicates features of OGR (though I think Fiona's validation is a bit more complete).

Or a way to turn off the validation might be helpful (although I don't know an option is worth the complexity). At least for data coming from geopandas, where we determined the schema from the data and we know this should be correct, some of this validation might not be needed.

instead of validating the schema before we attempt to create an OGR Feature, we could catch errors that arise in creating the feature and then do validation after to give users a super useful exception.

That seems like a nice approach.

A 5x speedup is already impressive!

The other possibility I thought of, besides the validation, is the creation of the OGR Features from a mapping vs from WKB. @sgillies do you think this might be faster / worth investigating?

Are any of you available to work on such things in the next few months?

I won't be able to put a lot of time on it in the short term. But I am very much interested in seeing this happen, and I also hope to be able to spend somewhat more time on the geo-tools next year.

sgillies commented 5 years ago

@jorisvandenbossche I do think that OGR's WKB serialization and deserialization would be faster than Fiona's dict-based approach. However, it won't cover feature properties, and wouldn't improve performance significantly when the number of geometry vertices is small. I wonder if the GDAL project would be interested in working on a standard for serializing features that has properties of GeoJSON (self-describing, language-independence) but is faster. Maybe msgpack?

jorisvandenbossche commented 5 years ago

Indeed, the serialisation of geometries is only part of the data. You are thinking about a standard for serializing a single record? In that case, something like msgpack might indeed be appropriate, as it basically is a binary version of JSON. If it would be about a batch or records / full tables, something like Arrow might also be an interesting path.

sgillies commented 2 years ago

I'm going to close this one. I'm seeing a ~2x speedup between 1.8.0 and 1.9a2. Further speedups should be part of the 2.0.0 work.

Toblerity / Fiona

Writing performance #685