fosskers commented 8 years ago

This is Part 2 of a series of issues documenting the process of creating a world's worth of VectorTiles from OSM Planet data. Please use these issues to discuss solutions.

A 55GB XML file (or 33GB Protobuf file) can't be trivially loaded into RAM for parsing.

Questions:

Is it at all performant to do a serial streaming read of the raw data, writing out each parsed Element one at a time, which could then be reread as some RDD[OSMElement]?
If not, is it possible to do a streaming scan of the file, identify logical "split" points, split the file, and then leverage Spark to parse each block separately?
If so, is it possible to do a similar scan over the Protobuf? What would the "logical split points" be?
How much custom parsing code would need to be written for each option?

fosskers commented 8 years ago

Consider for a serial ingest of Protobuf: https://github.com/topobyte/osm4j-pbf/blob/master/core/src/main/java/de/topobyte/osm4j/pbf/seq/PbfIterator.java#L47

fosskers commented 7 years ago

6 has addressed a lot of this.

geotrellis / vectorpipe

OSM => VectorTiles :: (2) Parsing raw OSM data into RDD[OSMElement] #2

6 has addressed a lot of this.