emmalanguage / emma

A quotation-based Scala DSL for scalable data analysis.
http://emma-language.org
Apache License 2.0
63 stars 19 forks source link

Add exists() unnesting rule. #4

Open aalexandrov opened 9 years ago

aalexandrov commented 9 years ago

The exists unnesting rule should be added to the normalization engine.

joroKr21 commented 8 years ago

This seems like an old issue that is no longer related. Can we close it?

aalexandrov commented 8 years ago

Yes.

joroKr21 commented 8 years ago

Obsolete due to c3130222e33cc7ea24805d44a6c07dd3f958cac4.

aalexandrov commented 8 years ago

@joroKr21 just to clarify, this is still on the TODO list and did not become obsolete due to c313022, it's just not so easy to add it at the moment as we need to integrate some notion of keys and identity in order to make the transformation safe while keeping it within the Bag (and not the Set) monad.

I hope this get resolved in the future as part of a different line of work.

joroKr21 commented 8 years ago

Ok, I'm confused, I thought we were talking about exists as a fold.

aalexandrov commented 8 years ago

No, we're talking about rewriting expressions like

val dataEngineers = for {
  s <- students
  if studentCourses.withFilter(_.sid == s.id).exists(_.major = "DataScience")
} yield s

into equivalent expressions (which can be translated to joins) of the form

val dataEngineers = (for {
  s <- students
  c <- studentCourses
  if c.sid = s.id
  if c.major = "DataScience"
} yield s).distinct()

This transformation is sound only if the original outer comprehension is without duplicates.

TPCH Q4 gives a good example of that transformation in SQL (see p. 34 in the TPC-H specification).

joroKr21 commented 8 years ago

In this case we should reopen.

ggevay commented 6 years ago

it's just not so easy to add it at the moment as we need to integrate some notion of keys and identity in order to make the transformation safe while keeping it within the bag (and not set) monad.

In the meantime, we have the field annotation pk, which is exactly this, if I understand correctly, right?

aalexandrov commented 6 years ago

Yes, the next step is to develop an analysis pass over the Emma Core representation that infers key constraints for intermediate results from their inputs. This is actually the main part of the work to close this issue, as we need this information in order to decide whether the rewrite is sound. Implementing the actual rewrite can be maybe done in a day or two.