One of the most common operations for CIM class RDD is to generate a pairRDD for join operations with:
XXX.keyBy (_.id)
It may be advantageous to formalize this use-case by storing pre-keyed pairRDD in the persistent RDD cache pool instead of just CIM object RDD, since the id (CIM rdf:ID = mRID) is the unique identifier for each CIM object.
Unfortunately, this has pervasive downstream consequences. Each operation to "get" an RDD by name, which is used extensively in CIMScala and dependent code like CIMApplication, would need to be modified to take advantage of this - or to work-around it if the keyBy (_.id) is not required.
For example:
val elements = get ("Elements").asInstanceOf[RDD[Element]].keyBy (_.id).join (...
becomes
val elements = get ("Elements").asInstanceOf[RDD[Element]].join (...
and
val terms = get ("Terminal").asInstanceOf[RDD[Terminal]].keyBy (_.ConductingEquipment).join (...
becomes
val terms = get ("Terminal").asInstanceOf[RDD[Terminal]].values.keyBy (_.ConductingEquipment).join (...
This also has effects on partitioning. I believe that the first element of the pair's hash code is used as the partition function for pairRDD, and hence caching pairRDD would trigger a shuffle as objects were coalesced into the machine that "owns" them.
Benchmarks should be performed before and after this change to determine if there is an actual speed improvement with typical use-case scenarios.
One of the most common operations for CIM class RDD is to generate a pairRDD for join operations with:
It may be advantageous to formalize this use-case by storing pre-keyed pairRDD in the persistent RDD cache pool instead of just CIM object RDD, since the id (CIM rdf:ID = mRID) is the unique identifier for each CIM object.
Unfortunately, this has pervasive downstream consequences. Each operation to "get" an RDD by name, which is used extensively in CIMScala and dependent code like CIMApplication, would need to be modified to take advantage of this - or to work-around it if the
keyBy (_.id)
is not required.For example:
This also has effects on partitioning. I believe that the first element of the pair's hash code is used as the partition function for pairRDD, and hence caching pairRDD would trigger a shuffle as objects were coalesced into the machine that "owns" them.
Benchmarks should be performed before and after this change to determine if there is an actual speed improvement with typical use-case scenarios.