Shapefile writer performance improvements?

airbreather commented 6 years ago

From @xivk on February 28, 2017 9:32

I'm working on a tool for a client of mine converting OSM data into a shapefile that can easily consumed for routing and analytics.

Basically I have 2 steps:

Building a routerdb with Itinero.
Writing the result as a shapefile.

Now Itinero can handle the entire world, the planet OSM file, and build a database for the entire world. I don't think it's even possible to attempt writing this to a shapefile but I am having performance issues, for example for a country like germany it takes about 10 hours (!) to write a shapefile with all german roads. Doing this in Itinero-only takes about 30 mins.

I have checked and the bottleneck clearly is the shapefile writer.

So I'm wondering, is it possible to improve the performance of writing a shapefile? Can anyone give me some pointers on where to get started? For example, why does it iterate twice over the source? Any way to compress the result? Germany now give me a shapefile of 70GB (!).

I'm more than willing to contribute the improvements to NTS as usual, but any help getting started would be appreciated.

Copied from original issue: NetTopologySuite/NetTopologySuite#154

airbreather commented 6 years ago

From @DGuidi on February 28, 2017 9:44

Any way to compress the result? Germany now give me a shapefile of 70GB (!).

My2Cents: shapefile is a binary format with a well defined standard, so size of file is directly related to size of data. maybe you can (I think):

simplify your geometries
create your own "ShapefileZipWriter" that directly generates a zipped archive of a shapefile (if possible)

airbreather commented 6 years ago

Oh... does BigEndianBinaryWriter need the same performance improvement I did in c6d2ccd? I see some calls to that class's naive WriteIntBE method. Maybe the corresponding reader too...

airbreather commented 6 years ago

If that doesn't help, could you please provide sample code and maybe a sample serialized RouterDb file so we can look at the same thing? I've got an old serialized routerdb file, but it dates back to the times when Itinero was part of OsmSharp, and I'm guessing you've changed stuff since then, and I really don't want to dedicate CPU time to rerun odp assuming it's still as slow as it was back then.

If I can just get that little bit of help, I'd love to spend time agonizing over this one.

airbreather commented 6 years ago

From @xivk on February 28, 2017 14:59

I'll try and build a sample application and do some profiling too, stay tuned. :-)

airbreather commented 6 years ago

I've done some exploration in my branch perf-exploration:

e2aba74 stops us from allocating extra on every big-endian write
578a508 stops us from flushing the output writer after every feature when we don't have to (i.e., when we're not writing a .shx file that needs to know)
aa6953d moves a heap allocation out of a loop
294ec76 handles writing the .shp, .dbf, and .shx (if any) files all in a single scan of the input features (also gets rid of this method's version of the heap allocation I moved out of that other loop)
9e8e1c1 lets us actually skip writing the .shx file if we don't want to (lots of stuff in the code seems to think that it might ever be possible for us not to write this).
baeb13f combines what used to be separate .Min() and .Max() loops into one, for both Z and M ordinates.
40a6b0d gets rid of another place where we would flush the stream before and after writing each feature; this time, we would just do it in order to throw an exception if there's a bug in our own code, which seems unnecessary (maybe we can bring it back with a property on the writer or something).

xivk commented 5 years ago

Related to this issue, if we can get rid of the count required parameter in the header we can use IEnumerables as input:

https://github.com/NetTopologySuite/NetTopologySuite.IO.ShapeFile/blob/master/NetTopologySuite.IO.GeoTools/ShapefileDataWriter.cs#L25

Maybe it's possible to write this count after all features have been enumerated?

I did notice a huge improvement already because we now seem to only enumerating the collection once instead of twice! :+1: :100:

airbreather commented 5 years ago

write this count after all features have been enumerated

A problem with this is that we would have to seek, which means that we'd start only supporting seekable streams (at least for callers that don't have a count to pass in, though branching like that adds maintenance cost), which may turn out to be really awkward.

xivk commented 5 years ago

Yes, I also considered that, but it's a pretty big show stopper for some usecases not to be able 'stream' the data into the writer.

Maybe we should just stop using shapefiles ;-)

Anyway, for now, I count features in another way before writing so I'm not blocked on this or something like that.

KubaSzostak commented 1 month ago

Resolved by #48

NetTopologySuite / NetTopologySuite.IO.ShapeFile

Shapefile writer performance improvements? #2