Closed airbreather closed 3 months ago
From @DGuidi on February 28, 2017 9:44
Any way to compress the result? Germany now give me a shapefile of 70GB (!).
My2Cents: shapefile is a binary format with a well defined standard, so size of file is directly related to size of data. maybe you can (I think):
Oh... does BigEndianBinaryWriter need the same performance improvement I did in c6d2ccd? I see some calls to that class's naive WriteIntBE method. Maybe the corresponding reader too...
If that doesn't help, could you please provide sample code and maybe a sample serialized RouterDb file so we can look at the same thing? I've got an old serialized routerdb file, but it dates back to the times when Itinero was part of OsmSharp, and I'm guessing you've changed stuff since then, and I really don't want to dedicate CPU time to rerun odp assuming it's still as slow as it was back then.
If I can just get that little bit of help, I'd love to spend time agonizing over this one.
From @xivk on February 28, 2017 14:59
I'll try and build a sample application and do some profiling too, stay tuned. :-)
I've done some exploration in my branch perf-exploration
:
.shx
file that needs to know).shp
, .dbf
, and .shx
(if any) files all in a single scan of the input features (also gets rid of this method's version of the heap allocation I moved out of that other loop).shx
file if we don't want to (lots of stuff in the code seems to think that it might ever be possible for us not to write this)..Min()
and .Max()
loops into one, for both Z and M ordinates.Related to this issue, if we can get rid of the count required parameter in the header we can use IEnumerables
Maybe it's possible to write this count after all features have been enumerated?
I did notice a huge improvement already because we now seem to only enumerating the collection once instead of twice! :+1: :100:
write this count after all features have been enumerated
A problem with this is that we would have to seek, which means that we'd start only supporting seekable streams (at least for callers that don't have a count
to pass in, though branching like that adds maintenance cost), which may turn out to be really awkward.
Yes, I also considered that, but it's a pretty big show stopper for some usecases not to be able 'stream' the data into the writer.
Maybe we should just stop using shapefiles ;-)
Anyway, for now, I count features in another way before writing so I'm not blocked on this or something like that.
Resolved by #48
From @xivk on February 28, 2017 9:32
I'm working on a tool for a client of mine converting OSM data into a shapefile that can easily consumed for routing and analytics.
Basically I have 2 steps:
Now Itinero can handle the entire world, the planet OSM file, and build a database for the entire world. I don't think it's even possible to attempt writing this to a shapefile but I am having performance issues, for example for a country like germany it takes about 10 hours (!) to write a shapefile with all german roads. Doing this in Itinero-only takes about 30 mins.
I have checked and the bottleneck clearly is the shapefile writer.
So I'm wondering, is it possible to improve the performance of writing a shapefile? Can anyone give me some pointers on where to get started? For example, why does it iterate twice over the source? Any way to compress the result? Germany now give me a shapefile of 70GB (!).
I'm more than willing to contribute the improvements to NTS as usual, but any help getting started would be appreciated.
Copied from original issue: NetTopologySuite/NetTopologySuite#154