Implement Custom Data Seletor (CDS)

ninjapapa commented 9 years ago

Requirement

As generalizing the idea of running sum and similar requirements, we identified the need of CDS. Using the running sum as an example, the client code looks like

srdd.smvSingleCDSGroupBy('k)(TimeInLastN('t, 3))((Sum('v) as 'nv1), (Count('v) as 'nv2))

where smvSingleCDSGroupBy declared as

def smvSingleCDSGroupBy(keys: Symbol*)(cds: SmvCDS)(aggrExprs: NamedExpression*): SchemaRDD

The aggrExprs is a list of AggregationExpressions with alias (as long as groupBy supports)

Also will implement custom Catalyst expressions to integrate SmvCDS into expressions. The client code looks like

srdd.smvGroupBy('k)(
    Sum('v1) from TimeInLastN('t, 3) as 'nv1,
    Count('v2) from TimeInLastN('t, 6) as 'nv2
)

Implementation

From pure relational algebra angle, a CDS defines

How the data should be self-joined
What is/are the aggregation key(s) of the self-joined data to create the groups to be aggregated on

Since CDS always used in a grouped (by a set of keys) environment, the self-join could be optimized as local operation. There are 2 approaches to optimize the self-join:

Operate on RDD level, use groupByKey to force data partition and create local iterator to represent the data, and implement self-join on Scala collection operations
Implement the self-join using SchemaRDD join without worrying optimization, and then consider to optimize SchemaRDD join to create the optimized data partition and perform join

Since it is possible that SparkSql team will implement the join optimization in the future, we will implement the first approach in the short term and migrate to the 2nd approach in the future.

ninjapapa commented 9 years ago

Also need a method to output the Srdd before the groupBy step. As equivalent to

srdd.smvSingleCDSGroupBy('k)(TimeInLastN('t, 3))((Sum('v) as 'nv1), (Count('v) as 'nv2))

Client code could also be

srdd.smvApplyCDS('k)(TimeInLastN('t, 3)).groupBy('k, 't)(
  Sum('v) as 'nv,
  Count('v) as 'nv2)

The benefit of doing so is that we can chain multiple CDS' together as the following

srdd.
  smvApplyCDS('k)(TimeInLastN('t, 3)).
  smvApplyCDS('k, 't)(PivotCDS(Seq('p), Seq('v), Seq("pv1", "pv2"))).
  groupBy('k, 't)(
    Sum('v_pv1) as 'v_pv1,
    CountDistinct('v_pv2) as 'dist_cnt_v_pv2)

where PivotCDS is a CDS object for pivot operation. Need to be implemented as a separated issue.

ninjapapa commented 9 years ago

Implemented a full new design of CDS.

TresAmigosSD / SMV

Implement Custom Data Seletor (CDS) #106

Requirement

Implementation