derrickoswald / CIMSpark

Spark access to Common Information Model (CIM) files
MIT License
15 stars 1 forks source link

cache CIM classes as pairRDD #2

Open derrickoswald opened 7 years ago

derrickoswald commented 7 years ago

One of the most common operations for CIM class RDD is to generate a pairRDD for join operations with:

XXX.keyBy (_.id)

It may be advantageous to formalize this use-case by storing pre-keyed pairRDD in the persistent RDD cache pool instead of just CIM object RDD, since the id (CIM rdf:ID = mRID) is the unique identifier for each CIM object.

Unfortunately, this has pervasive downstream consequences. Each operation to "get" an RDD by name, which is used extensively in CIMScala and dependent code like CIMApplication, would need to be modified to take advantage of this - or to work-around it if the keyBy (_.id) is not required.

For example:

val elements = get ("Elements").asInstanceOf[RDD[Element]].keyBy (_.id).join (...

becomes

val elements = get ("Elements").asInstanceOf[RDD[Element]].join (...

and

val terms = get ("Terminal").asInstanceOf[RDD[Terminal]].keyBy (_.ConductingEquipment).join (...

becomes

val terms = get ("Terminal").asInstanceOf[RDD[Terminal]].values.keyBy (_.ConductingEquipment).join (...

This also has effects on partitioning. I believe that the first element of the pair's hash code is used as the partition function for pairRDD, and hence caching pairRDD would trigger a shuffle as objects were coalesced into the machine that "owns" them.

Benchmarks should be performed before and after this change to determine if there is an actual speed improvement with typical use-case scenarios.