geotrellis / vectorpipe

Convert Vector data to VectorTiles with GeoTrellis.
https://geotrellis.github.io/vectorpipe/
Other
74 stars 20 forks source link

OSM => VectorTiles :: (2) Parsing raw OSM data into RDD[OSMElement] #2

Closed fosskers closed 7 years ago

fosskers commented 8 years ago

This is Part 2 of a series of issues documenting the process of creating a world's worth of VectorTiles from OSM Planet data. Please use these issues to discuss solutions.

A 55GB XML file (or 33GB Protobuf file) can't be trivially loaded into RAM for parsing.

Questions:

  1. Is it at all performant to do a serial streaming read of the raw data, writing out each parsed Element one at a time, which could then be reread as some RDD[OSMElement]?
  2. If not, is it possible to do a streaming scan of the file, identify logical "split" points, split the file, and then leverage Spark to parse each block separately?
  3. If so, is it possible to do a similar scan over the Protobuf? What would the "logical split points" be?
  4. How much custom parsing code would need to be written for each option?
fosskers commented 8 years ago

Consider for a serial ingest of Protobuf: https://github.com/topobyte/osm4j-pbf/blob/master/core/src/main/java/de/topobyte/osm4j/pbf/seq/PbfIterator.java#L47

fosskers commented 7 years ago

6 has addressed a lot of this.