Sveino / Inst4CIM-KG

Instance of CIM Knowledge Graph
Apache License 2.0
5 stars 1 forks source link

Duplication Between Ontologies #5

Open VladimirAlexiev opened 2 months ago

VladimirAlexiev commented 2 months ago

Common terms are duplicated many times between ontologies. See detailed analysis in rdfs-improvement/README:

And the files mentioned below. So these are just the counts:

wc -l *.txt
   882 duplicated-definitions.txt
   875 duplicated-terms.txt
  7268 terms-uniq.txt

The problem is pervasive: 12% of terms are duplicated (875 out of 7268). The most "popular" terms are duplicated 28 times:

sort -rn duplicated-terms.txt |head -10
     28 cim:String
     28 cim:Date
     24 cim:IdentifiedObject.mRID
     24 cim:IdentifiedObject
     23 cim:Float
     22 cim:IdentifiedObject.name
     21 cim:UnitSymbol
     21 cim:UnitMultiplier
     21 cim:DateTime
     21 cim:Boolean

It's not only about primitives and other meta-terms. Electrical terms are also duplicated, eg:

     15 cim:ActivePower
     15 cim:ActivePower.multiplier
     15 cim:ActivePower.unit
     15 cim:ActivePower.value

What's the problem:

VladimirAlexiev commented 2 months ago

Eliminating duplication would require proper modularization, i.e. the creation of more ontology files: CIM datatypes, CIM core, CGMES core, etc.

griddigit-ci commented 2 months ago

Here we need to see how RDFS are using a subset of datatypes vocabulary. Now it is duplicated as we are lacking of linking mechanisms and vendors were preferring things to be self contained

VladimirAlexiev commented 2 months ago

Linking is done by owl:imports.

But if the definitions are identical and the ontologies won't be loaded in named graphs, then the duplication does no harm.

VladimirAlexiev commented 1 month ago

From @Sveino's presentation DX-PROF Balance vs Unbalance.pptx: image

We discussed the idea that instead of 20 AP ontologies that define terms multiple times, we can have 40 ontologies that define each term once, Core, Wires... etc reused (imported) by EQ, EQBD etc. This should happen in CIM18 using CimContextor (for vocabulary profiling).

VladimirAlexiev commented 1 month ago

BTW I notice a little redundancy here (eg from 61970-600-2_Topology-AP-Voc-RDFS2020_v3-0-0.rdf):

    <rdfs:subClassOf>
      <rdfs:Class rdf:about="http://iec.ch/TC57/CIM100#ACDCTerminal"/>
    </rdfs:subClassOf>

This not only refers to ACDCTerminal, but also specifies its RDF type (which is repeated in the full description of that class). Because every referenced class is redundantly defined in each ontology and formatted turtle eliminates duplicate triples, this smaller redundancy is not seen in turtle.

Other references don't have such redundancy, eg:

   <rdfs:range rdf:resource="http://iec.ch/TC57/CIM100#ConnectivityNode"/>
VladimirAlexiev commented 1 week ago

We agreed 2 weeks ago that "ontology modularization" is needed:

@Sveino @griddigit-ci right?

Sveino commented 1 week ago

Yes, this is the way I would like to have it for 61970-501:ED2. But I am not sure if we need to have this done before we can finalize the work.