CommonCoreOntology / CommonCoreOntologies

The Common Core Ontology Repository holds the current released version of the Common Core Ontology suite.
BSD 3-Clause "New" or "Revised" License
175 stars 51 forks source link

What serialization(s) will IEEE 3195.1 CCO be in? #202

Closed jimschoening1 closed 2 weeks ago

jimschoening1 commented 9 months ago

Brian Haugh asked: "There is also a question of whether to use Turtle, RDF/XML, or other serializations of the CCO for the reference version. I believe that the DIOWG guidance is leaning towards requiring RDF/XML for ontologies in its Foundry." Background: When approved, IEEE 3195.1 CCO will be a printable document, but could also refer to a digital file (most likely stored as a given CCO version at https://opensource.ieee.org/cco). The paper document will be available for sale as normal, and the digital file will be posted for free download. My Questions: What serializations will CCO be offered in? If more than one, are they all equivalent, or is one the standard and others derived? Will the printable document refer to the digital file? If so, is that acceptable for a standard? How did ISO/IEC-JTC1-21838-2 BFO handle this?

cameronmore commented 9 months ago

We discussed recently in a IEEE CCO subgroup meeting that we might want to reflect the BFO-ISO standard approach: the standard isn't the RDF file per se but a written document that outlines how portions of BFO satisfy the top-level requirements, and points to the GitHub as the site where it's developed and housed, so we should do the same thing with CCO. The particular version of CCO we submit will be changed, updated, expanded, and modified for as long as it's used and contributed to, so we shouldn't think of there being an "IEEE Official Version" (EDIT: there can be an official IEEE version, but the one that lives in production and development will be the GitHub one, presuming we keep the development on this repo which seems very wise)

The standard can of course include the file, but the heart of the standard is the written explanation showing how the CCO satisfies mid-level ontology criteria. Moreover, the CCO is really a collection of modules, so the merged-all-core file is technically a convenience file more than a single grand ontology.

Theoretically, RDF, OWL, TTL should be equivalent, so it might not technically matter. However, as developers, we prefer TTL serialization far more than others because it's easier to read, easier to 'cut and paste,' easier to transfer into SPARQL and SHACL, and easier to teach people how to use/read. I personally advocate TTL all the way down, in DIOWG as well (I'm not sure how XML got suggested as the DIOWG standard).

I can gladly leave this issue open how a little while for others to read and offer thoughts. @mark-jensen @johnbeve

(of course, the above is just my opinion, but sounds like the direction the group is going in)

johnbeve commented 9 months ago

My Questions: What serializations will CCO be offered in? If more than one, are they all equivalent, or is one the standard and others derived?

RDF serializations are equivalent. Any one can be converted into any of the others, using publicly-available tools or library features rdflib (not endorsing either, just indicating examples).

Reasons to prefer, say, RDF/XML over N-Triples or Turtle or JSON-LD will depend on implementation details, so neutrality is important. I'm partial to Turtle because I find it easy to read, but I've met many developers who prefer JSON-LD for the same reason.

These observations suggest though that we should provide users with or direct users to directions for re-serialization strategies.

Will the printable document refer to the digital file? If so, is that acceptable for a standard? How did ISO/IEC-JTC1-21838-2 BFO handle this?

ISO 21838 can be purchased in both printed and pdf+epub formats, but it is also downloadable for free assuming you agree to use the download to develop other ISO standards. As I understand, the content in each is the same.

jimschoening1 commented 9 months ago

I feel this issue is important enough to be fully vetted, presented to, and voted on by the Ontology Standards Working Group, so please leave this open until after a vote passes.

CCO has the attached approved Project Authorization Request (PAR): P3195.1 (1).pdf

5.3 states: "This standard must conform to IEEE P3195 Standard for Requirements for a Mid-Level Ontology." So, I agree CCO must show conformance.

5.2 states: "Scope of proposed standard: This standard defines a mid-level ontology that specifies a set of well-defined terms and relations commonly used across multiple domains..." This states we are standardizing a specific ontology. Let's say CCOv2.0 is what gets voted on and approved as a standard. v2.1 can follow and be released online, but it won't be an IEEE standard unless later approved as such.

I'm now starting to remember the unique arrangement Barry worked out with ISO, where they somehow delegated further development to his group of experts. We can explore that, but I doubt it applies to CCO and is not what our PAR authorizes. We would need to change the PAR to take that route.

@johnbeve I suggest you try drafting the paragraph that references the approved standard CCO, so we can show it to people with experience in how to do this. This might be simple, but I feel we should still get it reviewed sooner than later.

jimschoening1 commented 9 months ago

The emerging Verifiable Credential ecosystem is using JSON-LD. I see this ecosystem as needing P3195 ontologies and starting to use them as soon as the IEEE PURL server provides stable and trustworthy URLs. If we standardize on one serialization, it doesn't need to be JSON-LD (provided we provide a version in JSON-LD)(so why don't we provide various versions now for CCO v1.4?). As for PURL responses, why not give it in multiple formats, plus allow the query to ask for a given version. This standard might help with this: 'Best Practice Recipes for Publishing RDF Vocabularies' https://www.w3.org/TR/swbp-vocab-pub/

jimschoening1 commented 9 months ago

Let's consider this solution: The 'normative' (i.e. what is a standard) wording in P3195.1 CCO (the document that can be printed) would state: "CCO ontology is standardized in the following equivalent serializations (Turtle Syntax, RDF/XML Syntax, OWL/XML Syntax, and JSON-LD) and available for download at ." We could also include 'informative' wording to explain how additional equivalent serializations can be derived.

swartik commented 9 months ago

@jimschoening1 Is OWL/XML used widely enough to justify its inclusion?

I ask this question because of the ambiguities in file name suffixes. For Turtle and JSON-LD, everyone uses .ttl and .json. For RDF/XML and OWL/XML, there doesn't seem to be much agreement. For RDF/XML, I've seen .rdf, .owl, .xml, and .rdfxml. I don't have enough experience with OWL/XML to know the most common suffixes, but I've seen .owl.

Dropping a little-used serialization would save some configuration management headaches.

johnbeve commented 9 months ago

CCO ontology is standardized in the following equivalent serializations (Turtle Syntax, RDF/XML Syntax, OWL/XML Syntax, and JSON-LD) and available for download at...

I don't think you should attempt to enumerate serialization formats, partly for the reason raised by @swartik. I suggest: "The CCO artifact is serialized in a standard RDF format at..."

We can then write brief guidance located in the README wherever "..." points, explaining that serializations are semantically equivalent and how users can re-serialize to another format if needed for their environment.

jimschoening1 commented 9 months ago

John, When drafting the PAR, we were severely criticized by an IEEE senior reviewer for not knowing the difference between an artifact and its serialization. I hear consensus the core of this standard should be CCO.ttl. I propose we don't call it a serialization, because that begs the question, 'Serialization of what artifact, and shouldn't that artifact be the standard?' But there is no single artifact that could be the standard. So, could the standard state: "This document standardizes MergedAllCoreOntology-v1.4-2023-04-07.ttl as found at https://opensource.ieee.org/cco/CommonCoreOntologies/-/blob/master/cco-merged/MergedAllCoreOntology-v1.4-2023-04-07.ttl." (Replace with correct version and PURL and list all ontology modules if we decide to. )

Also, I'm working with the emerging Verifiable Credential community, which uses JSON-LD. Is there any reason why we could not offer CCO in multiple forms?

cameronmore commented 9 months ago

We could, and there are a number of tools that can convert from one to another. For example, if you have Python and the RDFLib package, it only takes a moment, this code converts it:

from rdflib import Graph

g = Graph()

# Load the ontology with the file name
g.parse("CCO.ttl")

# Change the serialization format
g.serialize("CCO.jsonld", format="json-ld")

I attached a copy of the current merged file to this comment. Note, it's a .txt file, so to use it as json it needs to be renamed CCO.jsonld or CCO.json See here: CCO.txt

jimschoening1 commented 9 months ago

But that could still be an obstacle for adoption. Some might not believe it is equivalent or part of the standard. So if there is no reason it couldn't be provided as a version of the standard, I propose it should. Also, for JSON-LD, I committed to do a demo to the Open Wallet Foundation of how a Verifiable Credential claim (for a term, e.g. height) includes a PURL in its JSON-LD code that returns a definition. As such, I'll need a version of CCO in JSON-LD, which I propose we load on our Gitlab repo. Plus our PURL Server will need to respond to queries in Turtle, JSON-LD, and any other from we include. Thoughts anyone?

mark-jensen commented 9 months ago

propose we don't call it a serialization, because that begs the question, 'Serialization of what artifact, and shouldn't that artifact be the standard?'

I agree. This concern about whether the specific files named in the standard are serialized in turtle or xml, or some other, is a false dilemma. We are standardizing the content of the 11 ontologies that comprise a particular named and dated release of CCO. It has to be verifiable OWL2 in some standard serialization. Is there a reason we need to provide more than format for the standard? Where did that come from? In the ISO spec for BFO under section 4.3 "OWL 2 formalization of BFO-2020", it names two OWL versions, one in RDF (presumably this is XML) and the other in OWL functional syntax. I am not sure why they did this, maybe due to how BFO was built or some other requirement related to the FOL version or consistency checking. @alanruttenberg @johnbeve Do you recall why?

Is there any reason why we could not offer CCO in multiple forms?

If there is enough of a demand signal from users to provide alternate encodings, then of course we can.

Plus our PURL Server will need to respond to queries in Turtle, JSON-LD, and any other from we include.

Who decided that the PURL server has to negotiate gets for JSONLD? Or turtle for that matter? More importantly, what does that have to do with the format(s) of what gets submitted as the standard?

It is important to keep these three things separate:

  1. CCO as it exists in the repository. It will change as edits are made and new releases generated.
  2. The files that are returned via the IRIs. These are generated using those in no. 1, but nonetheless separate, and will always reflect the current release of CCO.
  3. The files that get named in the P3195.1. These will (I assume) be stored somewhere else as an archive where they can be accessed independently of the repo. They will be copies of the ontology files at certain point in the commit history of the repo, a snapshot of a version of CCO.

@jimschoening1 How does IEEE SA store and archive standards once accepted?

I'd like to note that the scope of this issue has crept and I think it would be wise to keep it limited to the specifics of what serialization(s) are named, if at all, in the standard 3195.1, as well as the language surrounding how they are described. Discussion about which formats are needed to be returned via the PURLs is a separate issue. Requests for providing CCO in alternate encodings for use-cases is another issue.

swartik commented 9 months ago

@cameronmore, are you suggesting your translation script be part of the CCO standard? I am wary of that. You would have to specify the versions of Python and rdflib. You would have to demonstrate that the script behaves equivalently across platforms and Python implementations. If I were an IEEE reviewer, I would look askance at anything less.

If you're just talking about what expert users can do, that's different. It still leaves open the question of what the standard should contain, and why.

@jimschoening1, can you educate us (me?) about serializations in standards? Is it common to include multiple serializations? What are popular serializations? What documentation describes the relationship between an artifact and its serialization? And what's the relationship between the standard and the gitlab server? Does the standard mention the gitlab server? And the PURL server, for that matter? Or the other way around? Or both? Do standards describe how to obtain serializations?

cameronmore commented 9 months ago

@swartik definitely not suggesting my script be part of the standard, just making it easier for someone to convert if they want.

johnbeve commented 9 months ago

John, When drafting the PAR, we were severely criticized by an IEEE senior reviewer for not knowing the difference between an artifact and its serialization.

Understood. I think we can display awareness of the difference without asserting a specific serialization.

I propose we don't call it a serialization, because that begs the question, 'Serialization of what artifact, and shouldn't that artifact be the standard?' But there is no single artifact that could be the standard.

Agreed. To be clear, my suggestion "The CCO artifact is serialized..." was meant to be compatible with the fact that:

  1. What is being standardized is (as @mark-jensen notes) the content of the CCO ontologies

While conveying that:

  1. The content of CCO is represented in some serialization format.
  2. The standard doesn't dictate what format.

Today the serialization might be TTL; in three years it might be RDF/XML.

I'd like to note that the scope of this issue has crept and I think it would be wise to keep it limited to the specifics of what serialization(s) are named, if at all, in the standard 3195.1, as well as the language surrounding how they are described.

Agreed.

jimschoening1 commented 9 months ago

I checked with Jonathn Goldberg, our IEEE Staff Advisor, who clarified our standards document will include a link to the ontology. He said, " Yes, the OS repository is linked in the draft pointing to the code. The code is not included in the draft. " I conclude this also means we can easily point to multiple serializations and assert they are equivalent.

alanruttenberg commented 9 months ago

The normative exchange format for OWL is RDF/XML. As @johnbeve says all the formats are equivalent and are trivially generated so we should supply whatever formats anybody finds useful. We should not use content negotiation - the suffix of the file is used to distinguish versions. ".owl" is for the RDF/XML. ".ttl" for turtle, etc.

For BFO I included RDF/XML because that's the normative format and functional syntax because I thought that was a useful alternative view. Sometimes people save their edit version in functional syntax because it does better with diff. I also included a functional syntax-like version with the labels substituted for IRIs, again because I thought that might be useful. I haven't received requests for other formats but if I did I would generate them as well.

I'll note also that I initially distributed some spreadsheets in Excel format but was asked in an issue to provide the same in CSV format so that it would not require proprietary software to be read.

swartik commented 9 months ago

@alanruttenberg Do you happen to know who decided the suffix for RDF/XML should be ".owl" in BFO, and why? I've seen ".owl" used for OWL/XML, and I've seen ".rdf" used for RDF/XML. I've also seen ".xml" used for both – they are XML documents, after all. The lack of agreement causes me headaches from time to time when I open a file with the Windows default tool I've established.

alanruttenberg commented 9 months ago

If you look in the functional syntax spec you will see

This document also defines the functional-style syntax, which closely follows the structural specification and allows OWL 2 ontologies to be written in a compact form. This syntax is used in the definitions of the semantics of OWL 2 ontologies, the mappings from and into the RDF/XML exchange syntax, and the different profiles of OWL 2. Concrete syntaxes, such as the functional-style syntax, often provide features not found in the structural specification, such as a mechanism for abbreviating IRIs"

All the files mentioned in that document have the extension .owl. It also says:

It is recommended that OWL functional-style Syntax files have the extension .ofn (all lowercase) on all platforms.

The OWL/XML document says

It is recommended that OWL XML Serialization files have the extension .owx (all lowercase) on all platforms.

The Manchester syntax document says:

It is recommended that OWL Manchester Syntax files have the extension ".omn" (all lowercase) on all platforms.

If you save an ontology from Protege without giving a file extension OWL/XML will get the extension .owx. If you save it as functional syntax it will get the extension .ofn . If you save it in the Manchester syntax you will get .omn.

Practically speaking you can save all of these with ".owl" and RDF/XML with ".rdf". The OWL 1 reference says uses .owl or .rdf. However, when I see .rdf I always am not sure whether it is OWL or generic RDF, so I don't use that and dislike when others do for owl, given that there are established conventions.

The RDF/XML document does say

It is recommended that RDF/XML files have the extension ".rdf" (all lowercase) on all platforms.

Protege will save RDF/XML with a default .rdf extension, However this is not the convention generally used for OWL files, c.f. the functional syntax document quote above.

The parsers I and Protege use try to figure out the file format based on the contents without really consulting the file extension.

I don't like content negotiation because it takes server configuration to work and because you can't be sure what you are going to get if you put the URL into a browser. I consider the different file formats to be distinct resources, though opinions differ.

swartik commented 9 months ago

@alanruttenberg Thanks. I haven't looked at the OWL 1 specifications for a long time. Nor have I memorized all the OWL 2 documentation. It always helps when one of the authors contributes.

That said, I'm not convinced suffix used is far enough along to qualify as having widely accepted established conventions. I had a sponsor tell me his tool required an RDF/XML file to have a name end with ".xml". I hope he's an outlier.

What version of Protege are you using? I'm running 5.5.0 on Windows. I just tried saving an ontology in RDF/XML, OWL/XML, and Turtle by entering a non-suffixed name into the file name box. In all three cases, Protege added ".owl" as the suffix.

alanruttenberg commented 9 months ago

5.6.1 on MacOS Not sure which you are not convinced about. In any case a quick check shows the OBOFoundry uses .owl and IOF uses .rdf for RDF/XML

oliviahobai commented 2 weeks ago

As written here all serializations will contain equivalent content of the CCO.