OSM => VectorTiles :: (2) Reading `RDD[Element]` from an ORC file

fosskers commented 7 years ago

ORC files lend themselves to being read in parallel. Via AWS Athena, they can apparently been queried by SQL. This give us a window into the data via Spark, to parse the entire Earth in parallel, as previously couldn't be done with our XML approach.

Questions:

What's the simplest setup to begin querying an ORC file?
How do we parse the results?
What is Spark's role? Can we get an RDD easily?

mojodna commented 7 years ago

spark.conf.set("spark.sql.orc.filterPushdown", true)

val df = spark.read.format("orc").load("planet.orc")

df.select("id", "tags", "lat", "lon").limit(25).show()

It's also possible to register tables in the Hive metastore, but I don't have examples for that handy.

fosskers commented 7 years ago

~~We'll probably avoid Hive.~~ I've never used Hive and aren't sure of what role it would play. Either way, we'd like the simplest possible architecture such that the following function is possible (assuming an ORCResult return type for each row returned from select):

theWholePlanet: String => RDD[ORCResult]

fosskers commented 7 years ago

Minimal example.

 package playground

 import org.apache.spark._
 import org.apache.spark.sql._
 import org.apache.spark.sql.hive._

 // --- //                                                                                                

 object Orc extends App {
   override def main(args: Array[String]): Unit = {

     val conf = new SparkConf()
       .setMaster("local[*]")
       .setAppName("orc")

     implicit val sc: SparkContext = new SparkContext(conf)
     implicit val hc: HiveContext = new HiveContext(sc)

     /* Necessary for "predicate push-down" */
     hc.setConf("spark.sql.orc.filterPushdown", "true")

     /* A reader? */
     val df = hc.read.format("orc").load("JP.orc")

     val res: Dataset[Row] = df.select("id", "tags", "lat", "lon").limit(25)

     /* `println(res)` prints something odd */
     res.show()

     /* Safely stop the Spark Context */
     sc.stop()
   }
 }

which prints

+--------+-----+----------+-----------+
|      id| tags|       lat|        lon|
+--------+-----+----------+-----------+
|31236558|Map()|35.6351506|139.7678432|
|31236561|Map()|35.6350419|139.7678590|
|31236562|Map()|35.6379133|139.7591729|
|31236563|Map()|35.6381829|139.7585998|
|31236564|Map()|35.6385161|139.7582349|
|31236565|Map()|35.6387672|139.7580599|
|31236566|Map()|35.6390562|139.7579335|
|31236567|Map()|35.6391582|139.7579045|
|31236568|Map()|35.6393004|139.7578741|
|31236569|Map()|35.6378794|139.7589461|
|31236570|Map()|35.6379975|139.7586995|
|31236571|Map()|35.6382569|139.7583386|
|31236572|Map()|35.6384817|139.7581248|
|31236573|Map()|35.6389807|139.7578382|
|31236574|Map()|35.6393364|139.7577578|
|31236575|Map()|35.6395142|139.7577450|
|31236576|Map()|35.6347773|139.7699528|
|31236579|Map()|35.6349131|139.7705506|
|31236580|Map()|35.6350156|139.7709312|
|31236581|Map()|35.6356548|139.7727102|
+--------+-----+----------+-----------+

geotrellis / vectorpipe

OSM => VectorTiles :: (2) Reading `RDD[Element]` from an ORC file #8