There is currently a long list of 'post-processing' operations (about, normalize, deduplicate, join, topology, edges) performed by the CIMReader after reading in a set of CIM files.
These are currently hard-coded via options to support the using ch.ninecode.cim argument for sql import of CIM files in python and R (i.e. using the non-compiled API). It would be better if these operations were broken out into separate modules/packages and a generic mechanism to chain the operations was implemented.
This has implications such as:
how to provision these extra modules (currently CIMReader via Maven Central has all the functionality baked in) so that --jars or --packages on the spark-shell or spark-submit command line can include the necessary code
how to provide parameters to the post-processing code (e.g. ch.ninecode.cim.do_topo_islands and ch.ninecode.cim.force_retain_fuses for the CIMNetworkTopologyProcessor)
how to specify that these modules operate either on the raw Elements RDD or the SparkSQL/named RDD after subsetting
how to inform the CIMRelation code of the post-processing tasks and their ordering that need to be performed
how to allow for user-generated post-processors that are not part of the CIMSpark codebase
what is/are the interface specifications between the CIMReader and post-processing modules
There is currently a long list of 'post-processing' operations (about, normalize, deduplicate, join, topology, edges) performed by the CIMReader after reading in a set of CIM files.
These are currently hard-coded via
options
to support theusing ch.ninecode.cim
argument for sql import of CIM files in python and R (i.e. using the non-compiled API). It would be better if these operations were broken out into separate modules/packages and a generic mechanism to chain the operations was implemented.This has implications such as: