The VertexSetRDD[VD] stores the vertex attributes as an IndexedSeq[VD]. When a VertexSetRDD is first constructed from an RDD[(Vid,VD)] the attributes are stored in an Array[VD]. When mapValues is in invoked on a VertexSetRDD[VD] a new array is created and populated with the result of the map operation.
The current justification for two separate strategies is that the join operation is "light weight" and so recomputing it would not be costly. Alternatively, the mapValues operation could be arbitrarily expensive.
The
VertexSetRDD[VD]
stores the vertex attributes as anIndexedSeq[VD]
. When aVertexSetRDD
is first constructed from anRDD[(Vid,VD)]
the attributes are stored in anArray[VD]
. WhenmapValues
is in invoked on aVertexSetRDD[VD]
a new array is created and populated with the result of the map operation.https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L129
However when
leftJoin
is invoked anIndexedSeqView
is created:https://github.com/amplab/graphx/blob/master/graph/src/main/scala/org/apache/spark/graph/VertexSetRDD.scala#L192
Should both be implemented using views or should both be implemented using actual storage. The tradeoffs are the following:
I suspect all the operations should be implemented using the view but I am not sure what the implications are for caching.