amplab / graphx

Former GraphX development repository. GraphX has been merged into Apache Spark; please submit pull requests there.
https://github.com/apache/spark
Apache License 2.0
357 stars 103 forks source link

Lazy evaluation of join and map operations #32

Open jegonzal opened 10 years ago

jegonzal commented 10 years ago

The VertexSetRDD[VD] stores the vertex attributes as an IndexedSeq[VD]. When a VertexSetRDD is first constructed from an RDD[(Vid,VD)] the attributes are stored in an Array[VD]. When mapValues is in invoked on a VertexSetRDD[VD] a new array is created and populated with the result of the map operation.

https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L129

However when leftJoin is invoked an IndexedSeqView is created:

https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L192

Should both be implemented using views or should both be implemented using actual storage. The tradeoffs are the following:

  1. Using views means that long chains of computation might be invoked repeatedly.
  2. Using Arrays could lead to many long-lived allocations.

I suspect all the operations should be implemented using the view but I am not sure what the implications are for caching.

jegonzal commented 10 years ago

The current justification for two separate strategies is that the join operation is "light weight" and so recomputing it would not be costly. Alternatively, the mapValues operation could be arbitrarily expensive.