Open velvia opened 9 years ago
Link expired?
@velvia still expired
Lol. Still expired ...
I finally found a live link - though not sure how much longer this will be up too. Download the PDF while you can. https://code.google.com/p/supersonic/downloads/list
So, Supersonic is C++. There is also Apache Drill, but that might be C++ too.
I think in the short term that playing with Spark's Catalyst optimizer to get columnar or at least vector wise execution is the best bet. Here is a video:
http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/
Some thoughts:
RDD[Segment]
.More notes on where in Spark codebase to look for SQL Optimizer stages (Spark 1.5.x):
LogicalPlan
tied to a new DataFrame
instance (see LogicalPlan.scala, and DataFrame.logicalPlan
)org.apache.spark.sql.catalyst.analysis.Analyzer
goes over LogicalPlan, resolves references, produces another LogicalPlanorg.apache.spark.sql.catalyst.optimizer.Optimizer
optimizes the LogicalPlanSparkPlanner
uses various SparkStrategies
to convert the LogicalPlan
into a SparkPlan
.
org.apache.spark.sql.execution
packageSparkStrategies.{LeftSemiJoin, CanBroadcast, EquiJoinSelection}
.DataSourceStrategy
for how pushdown predicates are implementedSparkPlan
s execute()
method is called, which returns an RDD[InternalRow]
Custom execution strategies can be inserted -- see SQLContext.experimental
variable.
Changing the optimizer steps might require a custom optimizer and a custom SQLContext/QueryExecution class.
A current Spark ticket for pushing down aggregations into DataSources:
https://issues.apache.org/jira/browse/SPARK-12449
See Santiago's comment right above mine, for links to how Druid, Magellan, HBase and other folks are modifying Spark Catalyst plans to get aggregation done on server side.
https://slack-files.com/files-pri-safe/T03BMF0R2-F0A3LCQ3C/api-presentation_1_.pdf?c=1441299236-4641d956f1354dd200dd184c1f1fc76fc59b9d2c