VladimirAlexiev / soml

Semantic Objects Modeling Language
8 stars 2 forks source link

+startup: nonum

+options: ':nil *:t -:t ::t <:t H:5 \n:nil ^:{} anchor:t arch:headline author:t

+options: broken-links:nil c:nil creator:nil d:(not "LOGBOOK") date:t e:t email:nil f:t

+options: inline:t num:nil p:nil pri:nil prop:nil stat:t tags:t tasks:t tex:t

+options: timestamp:nil title:t toc:5 todo:t |:t

+title: Semantic Objects Modeling Language

+date: <2024-10-14>

+author: Vladimir Alexiev

+email: vladimir.alexiev@ontotext.com

+language: en

+select_tags: export

+exclude_tags: noexport

+creator: Emacs 26.1 (Org mode 9.2.2)

(You can read this [[https://github.com/VladimirAlexiev/soml/][rendered by github]] or [[https://rawgit2.com/VladimirAlexiev/soml/master/README.html][rendered as HTML]])

:CONTENTS:

The Semantic Objects Modeling Language (SOML) is a simple YAML-based language for describing business objects (business entities, domain objects) that are handled using semantic technologies and GraphQL. SOML is the language of the Ontotext Platform.

The Ontotext Platform helps create knowledge graphs in an easier way, both using and enabling text analytics to interlink and enrich knowledge graphs, and enabling better search, exploration, classification, and recommendation across diverse information spaces.

Version 3.0 of the Platform was released at the end of 2019 and is under active develoment and evolution.

This blog goes in-depth on SOML: motivates the use of SOML, describes some tricky points of SOML usage and GraphQL deficiencies, and introduces some SOML tooling for generating SOML (~tsv2soml~) and mapping RDF data (~soml-map~)

** Why SOML? :PROPERTIES: :CUSTOM_ID: why-soml :END:

We often are asked why we had to introduce yet another object modeling language? Why didn't we use existing semantic web mechanisms such as ontologies (RDFS or OWL) or shapes (SHACL or SHEX), or schema mechanisms such as GraphQL schema, JSON schema and the like? The answer is multi-fold:

** Similar Schema Languages :PROPERTIES: :CUSTOM_ID: similar-schema-languages :END:

A number of schema languages have appeared recently that are based on YAML, express business-level object models that are somewhat independent of technological choices, and can render the models to a variety of schema technologies:

** What is SOML :PROPERTIES: :CUSTOM_ID: what-is-soml :END:

At present SOML is very simple, but will evolve to include more features. The overall structure of a SOML file (schema) is shown below.

+begin_src yaml

# comment
id:          /soml/<identifier>
label:       some name
created:     yyyy-mm-dd
updated:     yyyy-mm-dd
creator:     name and/or URL
versionInfo: version

# comment
specialPrefixes:
  base_iri:     <base>
  vocab_iri:    <vocab>
  vocab_prefix: <voc>
  ontology_iri: <ontology>
  shape_iri:    <shape>
prefixes:
  <pfx>:        <namespace>

# datatypes
types:
  <type>:       {rdf: <xsd-type>,    graphql: <GQL-type>, descr: "...", graphqlExtension: <boolean>}
  <union-type>: {union: [<type>...], graphql: <GQL-type>, descr: "..."}

# common property definitions
properties:
  <prop>:  {label: "...", descr: "...", range: <datatype|Obj>, rangeCheck: <boolean>, typeCast: <boolean>,
            kind: (object|literal|mixed), min: <default 0>, max: <default 1>,
            inverseAlias: <prop>, inverse: <prop>, rdfProp: pfx:prop, symmetric: <boolean>, regex: '<regex>', prefix: "<string>"}

# object class definitions
objects:
  <Obj>:  {label: "...", descr: "...", regex: '<regex>', prefix: "<string>",
           typeProp: <prop>, type: [<iri>...], name: <prop>, inherits: <Obj>, kind: (abstract|supertype)}
    props:
      <prop>: ...

+end_src

From this schema the Platform generates a complex GraphQL schema including a fairly complete querying language that allows you to find any kind of object, filter, order, navigate through the KG, and do pagination (limit, offset).

You can find details in the [[http://platform.ontotext.com/soml/index.html][SOML documentation]], while below we describe some tricky points of SOML usage and GraphQL deficiencies, and some tooling.

To introduce the proper context for this blog (working with complex SOML schemas), we'll describe the Ontotext Company Graph (ONTO CG) ontology and model. It's a medium-high complexity data model that reuses 14 ontologies and adds classes and props of its own. Of its 24 classes and 150 props, about half are reused and half are created especially for CG. It's fairly typical data model for the kind of projects that Ontotext deals with.

Creating the ONTO CG knowledge graph is part of the [[https://www.ontotext.com/cima/][Intelligent Matching and Linking of Company Data (CIMA)]] research project. We are integrating data from open and a few proprietary datasets. The emphasis of the project is on financial transactions, industrial classification, company size/importance observations (e.g. annual sales, number of employees), etc.

The following table shows the count of classes and properties defined by the ONTO-CG ontology, as well as those reused from other ontologies.

+CAPTION: Ontology reuse and extension in Ontotext Company Graph.

| Prefix | Ontology | Classes | Props | |--------+--------------------------------------+---------+-------| | cg | Ontotext Company Graph | 12 | 70 | | adms | Asset Description Metadata Schema | 1 | 1 | | dcat | Data Catalog Vocabulary | | 3 | | dct | Dublin Core Terms | | 8 | | ebg | euBusinessGraph | 1 | 12 | | gn | GeoNames | 1 | 9 | | locn | W3C Location Ontology | 1 | 8 | | org | W3C Core Organization Ontology | 1 | 5 | | qb | W3C Cube Ontology | 1 | 1 | | rov | W3C Registered Organization | 1 | 4 | | schema | Schema.org | 3 | 12 | | skos | Simple Knowledge Organization System | 1 | 6 | | time | W3C Time Ontology | | 2 | | void | Vocabulary of Interlinked Datasets | 1 | 7 | | wgs84 | World Geodetic Survey | | 2 | |--------+--------------------------------------+---------+-------| | | | 24 | 150 |

ONTO CG builds upon the results of the euBusinessGraph project. The euBusinessGraph semantic model and dataset covers the following (we have submitted a description of it to a prominent journal on semantic technologies):

ONTO-CG steps on the euBusinessGraph model and adds the following:

In addition to the above new classes, ONTO-CG adds:

The full CG schema is included: [[./schemas/CG.yaml][CG.yaml]]. Below we show a couple of typical examples.

** Example Class :PROPERTIES: :CUSTOM_ID: example-class :END:

+begin_src yaml

ExchangeListing: label: "Exchange Listing" inherits: Transaction type: [cg:ExchangeListing] descr: "Public offering (IPO, SPO etc) wheres the company receives money from the wide public, and as a result is listed for trading on an exchange" props: exchange: label: "exchange" range: StockExchange min: 1 rdfProp: cg:exchange descr: "Stock exchange" stockSymbol: label: "stock symbol" range: string rdfProp: cg:stockSymbol descr: "Stock symbol (ticker). TODO: this should also be represented as an Identifier?" valuation: label: "valuation (MUSD)" range: decimal rdfProp: cg:valuation descr: "Company valuation at IPO in MUSD" valuationLocal: label: "valuation (M local currency)" range: decimal rdfProp: cg:valuationLocal descr: "Company valuation at IPO in millions of local currency" valuationCurrency: label: "valuation currency" range: string rdfProp: cg:valuationCurrency descr: "Currency code of the valuation" dateEnd: descr: "Date delisted or left this exchange" isCurrent: rdfProp: cg:isCurrent descr: "Whether the listing is still effective"

+end_src

If you look closely, you may wonder where the range and RDF mapping of ~dateEnd~ is defined. It's in the list of reusable properties:

+begin_src yaml

properties: # reused props dateEnd: {label: "dateCompleted", range: dateOrYearOrMonth, rdfProp: cg:dateEnd}

+end_src

A more appropriate ~descr~ is given at the object level, overriding the generic description.

** Example Inverse Alias :PROPERTIES: :CUSTOM_ID: example-inverse-alias :END:

A Position is an associative node between Person and Organization that adds more data (not shown):

+begin_src yaml

Position: label: "Position" inherits: BusinessObject type: [org:Membership] descr: "Position of a person in an organization, former or current" props: person: {label: "person", range: PersonCommon, min: 1, rdfProp: org:member} organization: {label: "organization", range: OrganizationCommon, min: 1, rdfProp: org:organization}

+end_src

To allow navigation in any direction (not just from Position out, but also in), we add inverse aliases:

+begin_src yaml

PersonCommon: props: position: {label: "position", range: Position, inverseAlias: person} OrganizationCommon: props: position: {label: "position", range: Position, inverseAlias: organization}

+end_src

** Example Diagram: Exchange Listing SOML :PROPERTIES: :CUSTOM_ID: example-diagram-exchange-listing-soml :END:

For example, the figure below shows the stock exchange listing (IPO) of Apple on the Tokyo exchange and NASDAQ, and the listing of Nasdaq Inc (the company) on NASDAQ (the stock exchange).

[[./eg/model-exchange-listing.ttl]]

[[./eg/model-exchange-listing.png]]

** Example Diagram: Exchange Listing RDF :PROPERTIES: :CUSTOM_ID: example-diagram-exchange-listing-rdf :END:

This version of the diagram uses [[*soml-map][soml-map]] to map SOML names to RDF names in specific namespaces.

[[./eg/model-exchange-listing-mapped.ttl]]

[[./eg/model-exchange-listing-mapped.png]]

** owl2soml :PROPERTIES: :CUSTOM_ID: owl2soml :END:

This tool (written in Perl) generates SOML schemas from ontologies (that use RDFS, OWL and/or schema.org constructs). It handles numerous features and has been integrated in the Ontotext Platform (reimplemented in Java). See its own README.

** tsv2soml :PROPERTIES: :CUSTOM_ID: tsv2soml :END:

Editing large schemas is often easier to do in a table, even when the schema language is simple. (Also, this enables domain experts to participate in schema authoring, even if only editing the descriptions.)

The CG model was not written by hand, it was generated from a TSV (google sheet).

The sheet has 300 rows, and the generated SOML is 1176 lines. Here's the beginning of the sheet:

[[./eg/CG-sheet.png]]

Here is the end of the sheet, which exposes various thesauri (~ConceptSchemes~) as distinct business classes

[[./eg/CG-sheet2.png]]

To generate a SOML schema from the google sheet [[https://docs.google.com/spreadsheets/d/1_-bn9Y-9rtysnvKiVus6BkFKXqHhiV4vCjYeiRmb6XU/edit#gid=0][CG-data-model]], call it like this: : curl -s "https://docs.google.com/spreadsheets/d/1_-bn9Y-9rtysnvKiVus6BkFKXqHhiV4vCjYeiRmb6XU/export?format=tsv" | perl tsv2soml.pl | cat CG-preamble.yaml - > CG.yaml

Here [[./schemas/CG-preamble.yaml][CG-preamble.yaml]] is some fixed SOML metadata (a header).

Options:

Comment lines start with hash (~#~) in the first column

*** Reusing Property Characteristics :PROPERTIES: :CUSTOM_ID: reusing-property-characteristics :END:

Consider a SOML based on schema.org where we allow multiple ~sameAs~ values (e.g. the item's Wikipedia page, Wikidata entry, Linkedin profile, YouTube profile, etc), and want the field to be mandatory for ~Organization~ but optional for ~Person~.

We write the details on the first occurrence and then just mention the prop on the second occurrence:

| Class/prop | label | Inherits/range | char | descr | |--------------+---------+----------------+------------------+-------------------------------------------------------| | Organization | | | | | | sameAs | same as | iri | min: 1, max: inf | URL that unambiguously indicates the thing's identity | | Person | | | | | | sameAs | | | | |

This results in a SOML like this:

+begin_src yaml

objects: Organization: props: sameAs: range: iri min: 1 max: inf descr: URL that unambiguously indicates the thing's identity Person: props: sameAs: properties: # reused props sameAs: range: iri min: 1 max: inf descr: URL that unambiguously indicates the thing's identity

+end_src

The prop characteristics are copied from ~properties~ into ~Person.props~, which means that the default cardinality ~min: 0~ is overridden by ~min: 1~, which doesn't match the requirement "optional for ~Person~".

To fix this, you need to specify ~min: 0~ explicitly in the ~Person.sameAs~ table row.

** soml-map :PROPERTIES: :CUSTOM_ID: soml-map :END:

~tsv2soml~ writes out a file ~soml-map.tsv~ (see exmaple [[./schemas/soml-map.tsv][soml-map.tsv]]) with columns "class, prop (optional), rdf"

It can be used to map from SOML names to RDF names (class/prop URLs) in specific namespaces. Eg compare [[Example Diagram: Exchange Listing SOML][Example Diagram: Exchange Listing SOML]] vs [[Example Diagram: Exchange Listing RDF][Example Diagram: Exchange Listing RDF]].

It can be used to map examples (models) or conversion scripts (TARQL, or SPARQL Update for OpenRefine) from a "logical" representation using uniform GraphQL names to a "physical" representaiton using specific RDF names.

Usage: : perl soml-map.pl < file.(tarql|ru|ttl) > file-mapped.(tarql|ru|ttl)

** soml-simplify :PROPERTIES: :CUSTOM_ID: soml-simplify :END:

The purpose of this script is to simplify a SOML entity schema significantly, so it can be communicated more easily to LLM for querying (NLQ to GraphQL).

Note: we need to preserve abstract classes (interfaces) because:

A draft script was made by GPT-4 following the specfication below, then improved by me.

*** Example Source (SOML schema) :PROPERTIES: :CUSTOM_ID: example-source-soml-schema :END:

+begin_src yaml

objects: AccumulatorReset: descr: This command reset the counter value to zero inherits: ControlInterface label: AccumulatorReset props: accumulatorReset.AccumulatorValue: {} type: cim:AccumulatorReset ControlInterface: descr: Abstract superclass of Control inherits: IdentifiedObjectInterface kind: abstract search: {nested: true} props: control.PowerSystemResource: {} properties: accumulatorReset.AccumulatorValue: descr: The accumulator value that is reset by the command inverseOf: accumulatorValue.AccumulatorReset kind: object label: AccumulatorValue max: 1 min: 1 range: AccumulatorValue rdfProp: cim:AccumulatorReset.AccumulatorValue control.PowerSystemResource: descr: 'The controller outputs used to...' inverseOf: powerSystemResource.Controls kind: object label: PowerSystemResource max: inf min: 0 range: PowerSystemResourceInterface rdfProp: cim:Control.PowerSystemResource

+end_src

*** Example Target (Simplified) :PROPERTIES: :CUSTOM_ID: example-target-simplified :END:

+begin_src yaml

AccumulatorReset: ISA: ControlInterface accumulatorReset.AccumulatorValue: AccumulatorValue ControlInterface: ISA: IdentifiedObjectInterface control.PowerSystemResource: [PowerSystemResourceInterface]

+end_src

** tsv2owl :PROPERTIES: :CUSTOM_ID: tsv2owl :END: This tool uses the same sheets that drive ~tsv2soml~ to generate an OWL ontology. It works like this:

*** Ontology Preamble :PROPERTIES: :CUSTOM_ID: ontology-preamble :END: Use an ontology preamble, eg like this ~otkg-preamble.ttl~. Include all "system" prefixes that are shown after the newline (in particular, use ~s:~ instead of ~schema:~)!

+begin_src ttl

@prefix otkg: https://kg.ontotext.com/resource/ontology/. @prefix s: http://schema.org/ .

@prefix dct: http://purl.org/dc/terms/ . @prefix owl: http://www.w3.org/2002/07/owl# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix s: http://schema.org/ . @prefix xsd: http://www.w3.org/2001/XMLSchema# .

otkg: a owl:Ontology ; rdfs:label "OTKG Ontology" ; dct:created "2023-01-25"^^xsd:date ; dct:creator http://ontotext.com ; owl:versionInfo "1.0" .

+end_src

*** tsv2owl Options :PROPERTIES: :CUSTOM_ID: tsv2owl-options :END: It has these options.

*** tsv2owl Columns and Features :PROPERTIES: :CUSTOM_ID: tsv2owl-columns-and-features :END: The tool uses the following sheet columns:

Features and limitations:

*** RDF Replacement :PROPERTIES: :CUSTOM_ID: rdf-replacement :END: The column ~RDF~ is used when you need to specify something different from ~Class/name~. Eg if you have this in the SOML preamble:

+begin_src yaml

prefixes: vocab_iri: http://schema.org/ vocab_prefix: s

+end_src

The following tabular schema excerpt (leading dashes indicate the class hierarchy): | Class/prop | range/inherits | RDF | |---------------------+----------------+------------------------| | Event | | | | -EventSeries | Event | | | -EventParticipation | Event | otkg:EventPartcipation | generates ~s:Event~, ~s:EventSeries~ and ~otkg:EventPartcipation~ respectively.

Semantic Objects currently supports only abstract superclasses, so if we want to use all 3 classes with instance data, we need to add an abstract parent like this: | Class/prop | range/inherits | char | RDF | |---------------------+----------------+----------------+------------------------| | EventCommon | | kind: abstract | | | -Event | EventCommon | | | | -EventSeries | EventCommon | | | | -EventParticipation | EventCommon | | otkg:EventPartcipation |

It's quite common for one of the children (in this case ~Event~) not to have any props of its own, just to inherit the props of the parent.

This works fine for ~tsv2soml~, but ~tsv2owl~ would generate a parasitic (non-existent) RDF class ~s:EventCommon~. I thought of using the value ~RDF: none~ to signal that such class should be omitted. But then I'd need to carry over the properties and parent of that parasitic class to one of its children (in this case ~Event~).

So instead, I add an extra column ~RDF replacement~ that indicates which RDF class is used instead of the parasitic class: | Class/prop | range/inherits | char | RDF | RDF replacement | |---------------------+----------------+----------------+------------------------+-----------------| | EventCommon | | kind: abstract | | Event | | -Event | EventCommon | | | | | -EventSeries | EventCommon | | | | | -EventParticipation | EventCommon | | otkg:EventPartcipation | | This replaces all references to ~EventCommon~ with ~Event~: domain (prop attachment), range (prop target), superclass (parent), subclasses (children).

Notes:

You can also use replacement on leaf-level classes. Consider the following example from OTKG (two leading dashes indicate the properties attached to the prev class): | Class/prop | range/inherits | char | RDF | RDF replacement | |------------------+----------------+---------------------------------+---------------------+-----------------| | Concept | | name: prefLabel, kind: abstract | skos:Concept | | | --prefLabel | string | min: 1 | skos:prefLabel | | | --inScheme | ConceptScheme | | skos:inScheme | | | -Audience | Concept | typeProp: inScheme | OTKG:audience | skos:Concept | | -ContentType | Concept | typeProp: inScheme | OTKG:contentType | skos:Concept | | --appliesToClass | iri | | otkg:appliesToClass | | | PersonCommon | Thing | kind: abstract | | Person | | --jobTitle | string | | | | | --worksFor | Organization | min: 1 | | | | --sameAs | iri | max: inf | | | | -Person | PersonCommon | | | | | -OntotextPerson | PersonCommon | typeProp: worksFor | OTKG-agent:Ontotext | none | | --sameAs | iri | min: 1 | | | Several sub-classes have an additional type discriminator designated by ~typeProp~ (in addition to the standard ~rdf:type~):

*** Running tsv2owl :PROPERTIES: :CUSTOM_ID: running-tsv2owl :END: You can run the tool with a ~Makefile~ like this (see [[./tsv2owl][tsv2owl]]), feeding from a Google Sheet:

+begin_src Makefile

ontology.ttl :: curl -Ls "https://docs.google.com/spreadsheets/d/.../export?format=tsv" | \ perl -S tsv2owl.pl -vocab s: -ontology otkg: | \ cat ontology-preamble.ttl - > ontology-unformatted.ttl riot --formatted=ttl ontology-unformatted.ttl | \ perl -00e '@a=<>; print sort @a' > ontology.ttl

rm ontology-unformatted.ttl # keep for debugging

+end_src

The tool attaches props to classes using two ways: ~s:domainIncludes~ and ~owl:Restriction~:

+begin_src turtle

s:legalName rdf:type owl:DatatypeProperty ; s:domainIncludes s:Organization ; s:rangeIncludes xsd:string . s:Organization rdf:type owl:Class , rdfs:Class ; rdfs:subClassOf [ rdf:type owl:Restriction ; owl:maxCardinality 1 ;

owl:allValuesFrom xsd:string

                        owl:onProperty      s:legalName
                      ] .

+end_src

Metaphactory also needs the commented-out statement (~allValuesFrom~ pointing to the property range). To add that, use the SPARQL update ~tsv2owl-allValuesFrom.ru~. Replace ~riot~ with ~update~ (which is also part of Jena):

+begin_src Makefile

ontology.ttl :: curl -Ls "https://docs.google.com/spreadsheets/d/.../export?format=tsv" | \ perl -S tsv2owl.pl -vocab s: -ontology otkg: | \ cat ontology-preamble.ttl - > ontology-unformatted.ttl update --update=../bin/tsv2owl-allValuesFrom.ru --data=ontology-unformatted.ttl --dump | \ perl -00e '@a=<>; print sort @a' > ontology.ttl

rm ontology-unformatted.ttl # keep for debugging

+end_src

The tool emits ~s:domain/rangeIncludes~ (which are polymorphic, i.e. multi-valued). If you need to emit ~rdfs:domain/range~ (which are monomorphic, i.e. single-valued), use the SPARQL update ~tsv2owl-domain-range.ru~ that does this:

You can diff ~otkg.ttl~ with ~otkg-allValuesFrom.ttl~ and ~otkg-domain-range.ttl~ to see the difference.

** soml2puml :PROPERTIES: :CUSTOM_ID: soml2puml :END: Generate nice PlantUML diagrams from SOML models. See its own README.

** GraphQL Type vs rdf:type :PROPERTIES: :CUSTOM_ID: graphql-type-vs-rdf-type :END:

** Single vs Multiple-Value Props :PROPERTIES: :CUSTOM_ID: single-vs-multiple-value-props :END:

** Inverse Aliases :PROPERTIES: :CUSTOM_ID: inverse-aliases :END:

** Literals :PROPERTIES: :CUSTOM_ID: literals :END: (langString, union datatypes)

** Extended Pattern (Prefix + Regex) :PROPERTIES: :CUSTOM_ID: extended-pattern-prefix-regex :END:

** IRI Generation :PROPERTIES: :CUSTOM_ID: iri-generation :END:

** Schema Inclusion/Modularity :PROPERTIES: :CUSTOM_ID: schema-inclusion-modularity :END: