NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

Mismatch between in-memory mode and local mode #273

Closed blever closed 11 years ago

blever commented 11 years ago

Haven't been able to minimise much yet, but the following spec fails:

  "Failure" >> { implicit sc: SC =>

    val dlists = (0 to 3) map { i => DList(1 to 100).map(_ + (20 * i)).mapFlatten(x => Seq.fill(i + 1)(x)) }
    val (a, b, c, d) = (dlists(0), dlists(1), dlists(2), dlists(3))

    val au = a.distinct
    val bu = b.distinct
    val cu = c.distinct
    val du = d.distinct

    val as = au.size join DList("a")
    val bs = bu.size join DList("b")
    val cs = cu.size join DList("c")
    val ds = du.size join DList("d")

    val x = ((au.map((_, true))) join (bu.map((_, false)))).size join DList("x")
    val y = ((cu.map((_, true))) join (du.map((_, false)))).size join DList("y")

    val acd = au.diff(cu).size join DList("acd")
    val cad = cu.diff(au).size join DList("cad")
    val bdd = bu.diff(du).size join DList("bdd")
    val dbd = du.diff(bu).size join DList("dbd")

    val res = run((as ++ bs ++ cs ++ ds ++ x ++ y ++ acd ++ cad ++ bdd ++ dbd).materialise)

    val expected =
      Map(
        "a"   -> 100,
        "b"   -> 100,
        "c"   -> 100,
        "d"   -> 100,
        "x"   -> 80,
        "y"   -> 80,
        "acd" -> 40,
        "cad" -> 40,
        "bdd" -> 40,
        "dbd" -> 40)

    res.map(_.swap).toMap must_== expected
  }

The expected result is what in-memory mode returns. Running in local mode we get a failure as the result is different.

I am not blaming in-memory mode as removing lines from this example and running on a cluster gives different results.

Will see if the example can be minimised further to an example demonstrating a difference between in-memory and local mode.

Cheers!

blever commented 11 years ago

This is off master at 1025c9e2a16d12351bb56892853f61315a04d7ab.

etorreborre commented 11 years ago

I reduced your case to a similar bug with

def dlist = DList(("a", 1))
val (a, b, c, d) = (dlist, dlist, dlist, dlist)

val x  = (a join b).size
val y  = (c join d).size
val ac = a diff c

persist(x, y, ac)

Then there is an exception in the logs.