Add Datatypes To Instance Data

VladimirAlexiev commented 2 months ago

In CGMES instance data, all literals are string, but should be marked with the appropriate datatype.

E.g. cim:ACDCConverter.baseS should be marked ^^xsd:float
Otherwise sort won't work and range queries will be slower.
This pertains to boolean, dateTme, float, gMonthDay, integer as string is the default datatype

This query counts props by XSD datatype:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select ?range (count(*) as ?c) {
   ?x rdfs:range ?range
    filter(strstarts(str(?range), str(xsd:)))
} group by ?range order by ?range

Here are the current results, but it should be rerun after fixes to ontology: see col "comment"	range	c
xsd:boolean	218	Inflated because meta-data props are duplicated, and many are boolean
xsd:dateTime	5
xsd:decimal	1
xsd:float	310	Deflated because eg `cim:ActivePower.value` may be used by hundreds of "real" props
xsd:gMonthDay	2
xsd:integer	36
xsd:string	51

I have a tentative SPARQL Update, but need to revise it.

griddigit-ci commented 2 months ago

Need to discuss if we have concerns related to the file size. It will be very good if we have explicit datatypes in the instance data as this will not require post processing and mappings are parsing time to enable SHACL validation of datatypes. Is it common to assume the xsd:string so that we do not want to declare that one? How JSON-LD will deal with it? I think we can manage this with some sort of context so that we do not have repetitions in the serialisations.

VladimirAlexiev commented 2 months ago

@griddigit-ci @Sveino

I don't think you should worry about file size because when files get big, they are zipped, and compression will reduce such datatypes to a few bits per literal.
Curiously, integer and decimal take less space in Turtle than an equivalent string because the string needs quotes (1.23 is a decimal, whereas "1.23" is the corresponding string. But xsd:float always needs the datatype to be specified.
xsd:string is default by spec, so "foo" and "foo"^^xsd:string mean the same. Whether a repo stores it internally without datatype or with that default datatype, it doesn't matter

About JSON-LD: #55

VladimirAlexiev commented 1 month ago

Done, see https://github.com/Sveino/Inst4CIM-KG/blob/develop/rdf-improved/fix-datatypes.ru

Sveino commented 4 weeks ago

We cannot really do anything on this for CIMXML, so this must be address as for of JSON-LD. This was discussed in https://github.com/3lbits/CIM4NoUtility/issues/278. I agree with Valdimir comment regarding sizing - the support for zip is an issue we need to discuss as well. The thinking for JSON-LD was that this information is derived from the profile, but it must be 1.23 and not "1.23" for a float. My understanding now is that when we import the CIM XML we will run the script above that will add the correct Datatype for the instance data?

VladimirAlexiev commented 3 weeks ago

We cannot really do anything on this for CIMXML

Why not? If we do, it'll just add attribute rdf:datatype to every value. Will that break any software? Or do you mean that you cannot demand it in the spec?

when we import the CIM XML we will run the script above that will add the correct Datatype for the instance data?

It can be used in two ways:

With jena update (in-memory SPARQL Update) to add datatypes to a file. I'll use that when producing Trig and JSONLD
In a semantic repo after loading CIM XML

VladimirAlexiev commented 3 weeks ago

Reopening so you can decide whether it's unfeasible to use rdf:datatype in CIM XML.

Sveino commented 3 weeks ago

We cannot really do anything on this for CIMXML

Why not? If we do, it'll just add attribute rdf:datatype to every value. Will that break any software? Or do you mean that you cannot demand it in the spec?

We have to update the spec for CIM XML - which we are planning to do. But we do not have a approved specification for JSON-LD (however, we have indicated to the vendor how we are planning to do it)

when we import the CIM XML we will run the script above that will add the correct Datatype for the instance data?

It can be used in two ways:

With jena update (in-memory SPARQL Update) to add datatypes to a file. I'll use that when producing Trig and JSONLD In a semantic repo after loading CIM XML

We can also run a script of the instance fil to create a new instance file. At Statnett we also have a code for exporting CIM XML from GraphDB.

Sveino / Inst4CIM-KG

Add Datatypes To Instance Data #49