This is Part 2 of a series of issues documenting the process of creating a world's worth of VectorTiles from OSM Planet data. Please use these issues to discuss solutions.
A 55GB XML file (or 33GB Protobuf file) can't be trivially loaded into RAM for parsing.
Questions:
Is it at all performant to do a serial streaming read of the raw data, writing out each parsed Element one at a time, which could then be reread as some RDD[OSMElement]?
If not, is it possible to do a streaming scan of the file, identify logical "split" points, split the file, and then leverage Spark to parse each block separately?
If so, is it possible to do a similar scan over the Protobuf? What would the "logical split points" be?
How much custom parsing code would need to be written for each option?
This is Part 2 of a series of issues documenting the process of creating a world's worth of VectorTiles from OSM Planet data. Please use these issues to discuss solutions.
A 55GB XML file (or 33GB Protobuf file) can't be trivially loaded into RAM for parsing.
Questions:
Element
one at a time, which could then be reread as someRDD[OSMElement]
?