geotrellis / vectorpipe

Convert Vector data to VectorTiles with GeoTrellis.
https://geotrellis.github.io/vectorpipe/
Other
74 stars 20 forks source link

OSM => VectorTiles :: (2) Reading `RDD[Element]` from an ORC file #8

Closed fosskers closed 7 years ago

fosskers commented 7 years ago

ORC files lend themselves to being read in parallel. Via AWS Athena, they can apparently been queried by SQL. This give us a window into the data via Spark, to parse the entire Earth in parallel, as previously couldn't be done with our XML approach.

Questions:

mojodna commented 7 years ago
spark.conf.set("spark.sql.orc.filterPushdown", true)

val df = spark.read.format("orc").load("planet.orc")

df.select("id", "tags", "lat", "lon").limit(25).show()

It's also possible to register tables in the Hive metastore, but I don't have examples for that handy.

fosskers commented 7 years ago

We'll probably avoid Hive. I've never used Hive and aren't sure of what role it would play. Either way, we'd like the simplest possible architecture such that the following function is possible (assuming an ORCResult return type for each row returned from select):

theWholePlanet: String => RDD[ORCResult]
fosskers commented 7 years ago

Minimal example.

 package playground

 import org.apache.spark._
 import org.apache.spark.sql._
 import org.apache.spark.sql.hive._

 // --- //                                                                                                

 object Orc extends App {
   override def main(args: Array[String]): Unit = {

     val conf = new SparkConf()
       .setMaster("local[*]")
       .setAppName("orc")

     implicit val sc: SparkContext = new SparkContext(conf)
     implicit val hc: HiveContext = new HiveContext(sc)

     /* Necessary for "predicate push-down" */
     hc.setConf("spark.sql.orc.filterPushdown", "true")

     /* A reader? */
     val df = hc.read.format("orc").load("JP.orc")

     val res: Dataset[Row] = df.select("id", "tags", "lat", "lon").limit(25)

     /* `println(res)` prints something odd */
     res.show()

     /* Safely stop the Spark Context */
     sc.stop()
   }
 }

which prints

+--------+-----+----------+-----------+
|      id| tags|       lat|        lon|
+--------+-----+----------+-----------+
|31236558|Map()|35.6351506|139.7678432|
|31236561|Map()|35.6350419|139.7678590|
|31236562|Map()|35.6379133|139.7591729|
|31236563|Map()|35.6381829|139.7585998|
|31236564|Map()|35.6385161|139.7582349|
|31236565|Map()|35.6387672|139.7580599|
|31236566|Map()|35.6390562|139.7579335|
|31236567|Map()|35.6391582|139.7579045|
|31236568|Map()|35.6393004|139.7578741|
|31236569|Map()|35.6378794|139.7589461|
|31236570|Map()|35.6379975|139.7586995|
|31236571|Map()|35.6382569|139.7583386|
|31236572|Map()|35.6384817|139.7581248|
|31236573|Map()|35.6389807|139.7578382|
|31236574|Map()|35.6393364|139.7577578|
|31236575|Map()|35.6395142|139.7577450|
|31236576|Map()|35.6347773|139.7699528|
|31236579|Map()|35.6349131|139.7705506|
|31236580|Map()|35.6350156|139.7709312|
|31236581|Map()|35.6356548|139.7727102|
+--------+-----+----------+-----------+