SEMICeu / style-guide

SEMIC style guide to create reusable vocabularies and application profiles
https://semiceu.github.io/style-guide/
Creative Commons Attribution 4.0 International
10 stars 2 forks source link

Conceptual model as single source of truth (chapters 2.1 and 7.1) #73

Open RiittaA opened 1 year ago

RiittaA commented 1 year ago

We think that UML is not solid enough basis for conceptual modelling. We understand that UML has been chosen e.g. because of the graphical presentation that may be easier to business users to comprehend. But this may result in problems in later phases of the process. Therefore, we suggest an approach where the concepts are expressed in OWL, and the tool visualises them automatically or semi-automatically. Visualisation can be presented in UML-like notation. Shortly, the single source of truth should be based on a formal model, from which different representations are derived from.

albertoabellagarcia commented 1 year ago

My experience is that json schema provides 90% of the needs (100% for most users) with a very simple syntax, lots of software libraries to integrate it and applications working on real scenarios in actual clients.

ioggstream commented 1 year ago

Fully agree with @RiittaA on UML.

UML is not solid enough basis for conceptual modelling

+1 Moreover UML was build with a specific focus: OOP. OOP classes are not RDF classes.

the concepts are expressed in OWL, and the tool visualises them automatically or semi-automatically.

+1 We cannot introduce OOP abstractions just to have diagrams.

Moreover, this is going to confuse people using UML for generating classes: they expect every single UML bit to be reflected in the actual running code!

ioggstream commented 1 year ago

@albertoabellagarcia

json schema provides 90% of the needs (100% for most users) with a very simple syntax, lots of software libraries to integrate it and applications working on real scenarios in actual clients

I think json schema is great for implementation, but not for conceptual models. The idea that Italy is working on is using json-schema keywords to map properties to RDF subjects using this specification https://www.ietf.org/archive/id/draft-polli-restapi-ld-keywords-02.html

This allows for easily adapting a conceptual model written in RDF/turtle to real implementations based on OpenAPI and JSON Schema.

bertvannuffelen commented 1 year ago

I think we should firstly agree on one thing in this topic. Namely the separation between a technical implementation format and representation for a semantical conceptual model. Many of the comments mix this: JSON has everything a developer needs, UML is a programming abstraction, RDF is the format of the Linked Data engineer, etc. For building systems, these technical discussions are important, but that is not the topic of the SEMIC style guide.

The goal is to create a data specification that is implementation agnostic, focuses only on the semantics and has the ability to connect with implementation choices. Unless you as system data engineer are prepared to take a step back from your system context and want to discuss information structuring in the broad, system agnostic way, you will get a discussion over mismatch, or misusing a representation.

This holds for any implementation context: e.g.

Each and every implementation context MUST define its implementation mapping. Sometimes the effort can be limited, sometimes this is extremely complex. This is a key premise when we are discussing a data specifications according to the SEMIC styleguide: every community should take a step back from the representations that are provided: do not interpret them as ready-to-cook implementation languages but as means to share a common semantical view.

Now coming back to UML. As stated in the motivation of the SEMIC style guide, the goal is to use a graphical representation language for the conceptual model. And preferable one that is adopted by the business analysis community, so that it aids to bridge the inter-human communication. Besides the occasional academic alternatives, I see UML being used everywhere. Boxes for classes, lines for associations (object properties) and attributes (data properties). Instead of inventing a new graphical notation the SEMIC style guide states: let's use UML class diagram notation so that data specifications build by distinct organisations have a similar graphical representation. That will increase the common understanding. If everyone introduces its own legenda for the graphic notation, then we miss that opportunity.

This has led us to provide a common guideline on how to exploit the class diagram UML notation to make a diagram that resonates with a semantical textual description. Because the latter is the final goal: to express a semantical data specification: it is not the objective to prepare a OOP system implementation.

Now you can argue what should be the editorial environment: the semantical RDF style represenatation and have the diagram derived from it, or start from the diagram and have the RDF derived from it.
This is an editorial choice, yet very important. Unfortunately diagrams are very powerful when they are condensed and not overloaded. Since a semantic data specification is document that expresses an agreement between humans, each part should be somehow human friendly. Diagrams with 100's of classes and 1000's of lines crossing eachother are not accepted by humans. That is the motivation to choose for the diagram to RDF direction. In this direction automation is possible. In the other direction, the likelihood is high that one will create both representations independent and thus synchronisation issues appear.

So the arguments pro or contra UML class diagram notation should not be about its binding to OOP system implementations. Every used "formal-ish" syntax will suffer from that. The argumentation pro and contra should be about what is the common graphical language we like to use as community to document our data specifications.

Note that this graphical notation discussion does not exclude the use of other visual representations.

ioggstream commented 1 year ago

Thanks @bertvannuffelen for your reply. I understand the practical goal you expect from using UML.

IMHO "exploiting" a notation/specification does not scale

For example, look at the (apparently trivial) work on interoperability between YAML, JSON, JSON-Schema and JSON-LD here https://github.com/ietf-wg-httpapi/mediatypes: weeks of analysis with various implementers to avoid conflicts on the fragment identifier, the standardization of JSON-Schema media type is on hold until YAML mediatype will be published, the YAML-LD work was spun off to the YAML-LD... Long story short, when you "exploit" specs there's always more than meets the eye.

The argumentation pro and contra should be about what is the common graphical language we like to use as community to document our data specifications.

Reading https://github.com/SEMICeu/style-guide/blame/c444c915841fff0befc8ccc335d0175aed9b1c12/docs/modules/ROOT/pages/arhitectural-clarifications.adoc#L80

UML conceptual models can be used as the single source of truth

I understood the problem was that the UML was the language used for defining the models, not for just the rendering. I think the problem is the above sentence. Instead, it is OK to define:

this graphical notation discussion does not exclude the use of other visual representations

I have no problem in using UML just for data visualization.

bertvannuffelen commented 1 year ago

Thanks @bertvannuffelen for your reply. I understand the practical goal you expect from using UML.

IMHO "exploiting" a notation/specification does not scale

* It can work inside an organization.

* It may work for a closed ecosystem.

* It will fail at scale.

For example, look at the (apparently trivial) work on interoperability between YAML, JSON, JSON-Schema and JSON-LD here https://github.com/ietf-wg-httpapi/mediatypes: weeks of analysis with various implementers to avoid conflicts on the fragment identifier, the standardization of JSON-Schema media type is on hold until YAML mediatype will be published, the YAML-LD work was spun off to the YAML-LD... Long story short, when you "exploit" specs there's always more than meets the eye.

I am not sure what you want to argue here. But the complexity to align between technical representations is out-of-scope. What is within (future) scope is that a semantic data specification should provide the anchors to make a YAML implementation connectable with a JSON implementation. (i.e. the area of artefact generators).
For me the "implementation distance" from the semantical data specification to any implementation representation is roughly the same. (So whether it is implementing it is XML, JSON, edifact, RDF, JAVA, ...) Because always the same decisions have to be made.

Technical formats and decisions are by definition ecosystem and organisation limited. The goal with this style guide is not fix a single XSD schema that everyone has to use, but it is about describing the semantics in such a way that profiling is transparently documented. So if I implement DCAT-AP in my country then I know the rules how to further profile it for my country: e.g. I can enforce the need for a contact point with an email even if DCAT-AP does not specify that. Preferably I will do that in such a way that another country can interpret that. For instance to use vcard as ontology. When I publish in my country my DCAT-AP profile, then it should contribute to the ecosystem of DCAT in a seamless way, without the need to push my rules in the general DCAT. If my implementation is geonetwork XML based and another CKAN JSON based then although these formats/datastructures are technically incompatible, the knowledge that both adhere common semantics means that it is possible for both, my implementation and the other implementation, to produce the data in a commonly agreed technical format expressing the data in the same semantics. This common system agnostic technical format is often an RDF serialization. But that is actually a side effect from the choice to denote our terms unambiguous with dereferenceable URIs.