emmalanguage / emma

A quotation-based Scala DSL for scalable data analysis.
http://emma-language.org
Apache License 2.0
63 stars 19 forks source link

Extended IR #147

Closed aalexandrov closed 7 years ago

aalexandrov commented 8 years ago

This is a meta-issue that should track the planned changes in the IR.

Throughout the discussion in this issue, we will use the following schema based on the MostConsistentClassifier workflow:

// product types
case class Profile(id: Long, name: String, surname: String)
case class Email(id: Long, ip: Int, content: String)
case class Server(ip: IP, isBlacklisted: Boolean)
case class Classifier(model: Long, weights: SparseVector[Double])

Motivation

Most of the transformation-based optimizations in Emma can only be applied if certain conditions apply. Deciding whether the conditions are satisfied typically involves some form of dataflow analysis on top of lifted code.

Example: Fold-Group Fusion

Consider the following code that produces two aggregates over the emails grouped by their ip value:

for {
  g <- emails.groupBy(_.ip)
} yield {
  val agg1 = g.values.sum()
  val agg2 = g.values.map(_.content.size()).sum()
  (g.key, agg1, agg2)
}

The code is eligible to fold-group fusion (FGF) as all g.values uses appear within the context of chain of (zero or more) map applications ending in a fold.

While (FGF) of the variation above works at the moment, the following variations will not be considered eligible, as at the moment there is no direct and simple way to consider constant propagation in the Scala expressions referenced in comprehended terms:

// structural typing on the argument
for {
  (key, values) <- emails.groupBy(_.ip)
} yield {
  val agg1 = values.sum()
  val agg2 = values.map(_.content.size()).sum()
  (key, agg1, agg2)
}

// projection aliases
for {
  g <- emails.groupBy(_.ip)
} yield {
  val values = g.values
  val agg1 = values.sum()
  val agg2 = values.map(_.content.size()).sum()
  (g.key, agg1, agg2)
}

Example: Join Ordering

Another scenario where a better IR might be beneficial is join ordering. Consider the following code

for {
  e <- emails
  s <- people
  r <- people
  if from(e.content) == s.email
  if to(e.content) == s.email
} yield {
  val x1 = e1 // depends on e and s
  val x2 = e2 // depends on x1 and r
  f(x2)
}

In order to reduce the amount of communication cost when this expression is compiled to a join cascade, we might want to apply x1 right after we join e and s, prune all extra fields, and only after that join with r.

In effect, this means that the head expressions is split into multiple parts and parts of it might be applied on top of a rhs of aggregator during the combinator rewrite phase.

Approach

In order to make the compiler oblivious to such code modifications, I propose to extend the current comprhension IR in the following strategy.

The refactoring of the code should be based on the following refactoring plan. Each item will be tracked by a separate issue:

joroKr21 commented 8 years ago

Actually we could use a somewhat hacky solution to bring the separate hierarchies together. Since Scala's Constant(Literal(_)) is polymorphic we could wrap the comprehension nodes in constants and then use a modified form of traversal that works on union types (or possibly do 2 passes - one for Scala nodes and one for Emma nodes).

joroKr21 commented 7 years ago

I think we can close this now

aalexandrov commented 7 years ago

Alright. I've linked the issue on the wiki for future reference.