Closed fosskers closed 7 years ago
spark.conf.set("spark.sql.orc.filterPushdown", true)
val df = spark.read.format("orc").load("planet.orc")
df.select("id", "tags", "lat", "lon").limit(25).show()
It's also possible to register tables in the Hive metastore, but I don't have examples for that handy.
We'll probably avoid Hive. I've never used Hive and aren't sure of what role it would play. Either way, we'd like the simplest possible architecture such that the following function is possible (assuming an ORCResult
return type for each row returned from select
):
theWholePlanet: String => RDD[ORCResult]
Minimal example.
package playground
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.hive._
// --- //
object Orc extends App {
override def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("orc")
implicit val sc: SparkContext = new SparkContext(conf)
implicit val hc: HiveContext = new HiveContext(sc)
/* Necessary for "predicate push-down" */
hc.setConf("spark.sql.orc.filterPushdown", "true")
/* A reader? */
val df = hc.read.format("orc").load("JP.orc")
val res: Dataset[Row] = df.select("id", "tags", "lat", "lon").limit(25)
/* `println(res)` prints something odd */
res.show()
/* Safely stop the Spark Context */
sc.stop()
}
}
which prints
+--------+-----+----------+-----------+
| id| tags| lat| lon|
+--------+-----+----------+-----------+
|31236558|Map()|35.6351506|139.7678432|
|31236561|Map()|35.6350419|139.7678590|
|31236562|Map()|35.6379133|139.7591729|
|31236563|Map()|35.6381829|139.7585998|
|31236564|Map()|35.6385161|139.7582349|
|31236565|Map()|35.6387672|139.7580599|
|31236566|Map()|35.6390562|139.7579335|
|31236567|Map()|35.6391582|139.7579045|
|31236568|Map()|35.6393004|139.7578741|
|31236569|Map()|35.6378794|139.7589461|
|31236570|Map()|35.6379975|139.7586995|
|31236571|Map()|35.6382569|139.7583386|
|31236572|Map()|35.6384817|139.7581248|
|31236573|Map()|35.6389807|139.7578382|
|31236574|Map()|35.6393364|139.7577578|
|31236575|Map()|35.6395142|139.7577450|
|31236576|Map()|35.6347773|139.7699528|
|31236579|Map()|35.6349131|139.7705506|
|31236580|Map()|35.6350156|139.7709312|
|31236581|Map()|35.6356548|139.7727102|
+--------+-----+----------+-----------+
ORC files lend themselves to being read in parallel. Via AWS Athena, they can apparently been queried by SQL. This give us a window into the data via Spark, to parse the entire Earth in parallel, as previously couldn't be done with our XML approach.
Questions:
RDD
easily?