[Long term] Look into Supersonic query API

velvia commented 9 years ago

https://slack-files.com/files-pri-safe/T03BMF0R2-F0A3LCQ3C/api-presentation_1_.pdf?c=1441299236-4641d956f1354dd200dd184c1f1fc76fc59b9d2c

samklr commented 9 years ago

Link expired?

velvia commented 9 years ago

@samklr try this?

https://slack-files.com/files-pri-safe/T03BMF0R2-F0AFBB892/jethrodata_white_paper.pdf?c=1441927242-e4340a9d9477dca46000bf030eb89fddb468fd58

darkjh commented 9 years ago

@velvia still expired

samklr commented 9 years ago

Lol. Still expired ...

velvia commented 9 years ago

I finally found a live link - though not sure how much longer this will be up too. Download the PDF while you can. https://code.google.com/p/supersonic/downloads/list

velvia commented 8 years ago

So, Supersonic is C++. There is also Apache Drill, but that might be C++ too.

velvia commented 8 years ago

I think in the short term that playing with Spark's Catalyst optimizer to get columnar or at least vector wise execution is the best bet. Here is a video:

http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Some thoughts:

We could introduce an extra physical planner stage that does vector computation before passing it to the normal Aggregate* steps. However, we don't want to receive an RDD[InternalRow], but rather an RDD[Segment].
We could introduce something called "aggregation / expression pushdown", at first specific to the Filo data source only, that pushes down the columnar expressions / aggregation and grouping expressions. Then, the Filo data source could do computations on each segment and return an RDD[Row], hopefully with far fewer rows, for Spark to compute.

velvia commented 8 years ago

More notes on where in Spark codebase to look for SQL Optimizer stages (Spark 1.5.x):

Overall query execution flow: SQLContext#QueryExecution inner class
Step 1: SQL (or DataFrame DSL) is converted to a LogicalPlan tied to a new DataFrame instance (see LogicalPlan.scala, and DataFrame.logicalPlan)
Step 2: org.apache.spark.sql.catalyst.analysis.Analyzer goes over LogicalPlan, resolves references, produces another LogicalPlan
Step 3: Spark calls the CacheManager to determine if cached tables should be used --> withCachedData LogicalPlan
Step 4: org.apache.spark.sql.catalyst.optimizer.Optimizer optimizes the LogicalPlan
Step 5: SparkPlanner uses various SparkStrategies to convert the LogicalPlan into a SparkPlan.
- These are all in the org.apache.spark.sql.execution package
- For Joins, see SparkStrategies.{LeftSemiJoin, CanBroadcast, EquiJoinSelection}.
- See the DataSourceStrategy for how pushdown predicates are implemented
Step 6: The SparkPlans execute() method is called, which returns an RDD[InternalRow]

Custom execution strategies can be inserted -- see SQLContext.experimental variable.

Changing the optimizer steps might require a custom optimizer and a custom SQLContext/QueryExecution class.

velvia commented 8 years ago

A current Spark ticket for pushing down aggregations into DataSources:

https://issues.apache.org/jira/browse/SPARK-12449

See Santiago's comment right above mine, for links to how Druid, Magellan, HBase and other folks are modifying Spark Catalyst plans to get aggregation done on server side.

filodb / FiloDB

[Long term] Look into Supersonic query API #11