NetTopologySuite / NetTopologySuite.IO.ShapeFile

The ShapeFile IO module for NTS.
33 stars 24 forks source link

Shapefile writer performance improvements? #2

Closed airbreather closed 1 month ago

airbreather commented 6 years ago

From @xivk on February 28, 2017 9:32

I'm working on a tool for a client of mine converting OSM data into a shapefile that can easily consumed for routing and analytics.

Basically I have 2 steps:

  1. Building a routerdb with Itinero.
  2. Writing the result as a shapefile.

Now Itinero can handle the entire world, the planet OSM file, and build a database for the entire world. I don't think it's even possible to attempt writing this to a shapefile but I am having performance issues, for example for a country like germany it takes about 10 hours (!) to write a shapefile with all german roads. Doing this in Itinero-only takes about 30 mins.

I have checked and the bottleneck clearly is the shapefile writer.

So I'm wondering, is it possible to improve the performance of writing a shapefile? Can anyone give me some pointers on where to get started? For example, why does it iterate twice over the source? Any way to compress the result? Germany now give me a shapefile of 70GB (!).

I'm more than willing to contribute the improvements to NTS as usual, but any help getting started would be appreciated.

Copied from original issue: NetTopologySuite/NetTopologySuite#154

airbreather commented 6 years ago

From @DGuidi on February 28, 2017 9:44

Any way to compress the result? Germany now give me a shapefile of 70GB (!).

My2Cents: shapefile is a binary format with a well defined standard, so size of file is directly related to size of data. maybe you can (I think):

  1. simplify your geometries
  2. create your own "ShapefileZipWriter" that directly generates a zipped archive of a shapefile (if possible)
airbreather commented 6 years ago

Oh... does BigEndianBinaryWriter need the same performance improvement I did in c6d2ccd? I see some calls to that class's naive WriteIntBE method. Maybe the corresponding reader too...

airbreather commented 6 years ago

If that doesn't help, could you please provide sample code and maybe a sample serialized RouterDb file so we can look at the same thing? I've got an old serialized routerdb file, but it dates back to the times when Itinero was part of OsmSharp, and I'm guessing you've changed stuff since then, and I really don't want to dedicate CPU time to rerun odp assuming it's still as slow as it was back then.

If I can just get that little bit of help, I'd love to spend time agonizing over this one.

airbreather commented 6 years ago

From @xivk on February 28, 2017 14:59

I'll try and build a sample application and do some profiling too, stay tuned. :-)

airbreather commented 6 years ago

I've done some exploration in my branch perf-exploration:

xivk commented 5 years ago

Related to this issue, if we can get rid of the count required parameter in the header we can use IEnumerables as input:

https://github.com/NetTopologySuite/NetTopologySuite.IO.ShapeFile/blob/master/NetTopologySuite.IO.GeoTools/ShapefileDataWriter.cs#L25

Maybe it's possible to write this count after all features have been enumerated?

I did notice a huge improvement already because we now seem to only enumerating the collection once instead of twice! :+1: :100:

airbreather commented 5 years ago

write this count after all features have been enumerated

A problem with this is that we would have to seek, which means that we'd start only supporting seekable streams (at least for callers that don't have a count to pass in, though branching like that adds maintenance cost), which may turn out to be really awkward.

xivk commented 5 years ago

Yes, I also considered that, but it's a pretty big show stopper for some usecases not to be able 'stream' the data into the writer.

Maybe we should just stop using shapefiles ;-)

Anyway, for now, I count features in another way before writing so I'm not blocked on this or something like that.

KubaSzostak commented 1 month ago

Resolved by #48