3d Generation of NeXus ontology

sanbrock commented 2 years ago

NeXus Ontology

NeXus Base Classes and Application Definitions have been converted to an ontology, the NeXusOntology. While it preserves the relationships between the different NeXus definitions and also their respective Fields, its current implementation does not represent all the rules expressed by the definitions written in NXDL. Our aim is to enhance this ontology and overcome the current limitations to enable its use for automated data verification. Since several rules in NeXus definitions are only expressed for humans in docstrings and are not represented in NXDL on a machine-actionable way, it is required to support expressing these rules in the Ontology.

Original Design Concepts NeXus traditionally separates Base Classes and Application Definitions. Although the most important practical difference between them is the default optionality of the defined elements (which is “optional” for base_class definitions, but “required” inside application definitions), another concept, the Base Class Referencing or Reusability, determined the evolution of NeXus definitions during decades. To support an easy development of newer and newer Application Definitions without any sophisticated ontology tool, it was important to keep the controlled vocabulary of Base Classes limited and instead of creating subclasses, the specialisation of the referenced Base Classes in the Application Definitions, or the extension of the original Base Classes with optional elements were practiced. Note that while this concept avoids the explicit use of class extensions in Base Classes (and so extends=NXobject is used exclusively), it results in definitions with extremely wide meaning (or applicability) where almost everything must be optional (like NXtire could contain a type of an enum [summer, all-season, snow, studdable, studded] for cars, but also ISO-, French- and English designations for the size at bicycles, as well as a bool item tundra for light aircrafts). Hence, such Base Classes are not suited for data verification, and so Application Definitions must reference and extend/specialise/override each data item explicitly to

set at least their cardinality (required, recommended, or optional; minOccurs= and/or maxOccurs=),
restrict/extend the expected enumeration set, or
further specialise the docstring. Not considering cardinality, enumerations, and the specialisation of the docstrings, the current implementation of NexusOntology flattens the Applications Definitions (by pushing any new data item definitions to the respective referenced Base Class) and only represents the Base Class references (type=) with a new ontology relationship ’citesGroup’.

ENUMERATIONS AND DOCUMENTATION Enumerations are heavily used in NeXus to define the set of expected values. These hold important information about the definition in place and provide restrictions for data validation and verification. As mentioned above, NeXus definitions inherit and frequently also override enumerations from superclass definitions where superclass is defined by extends= and type= . Docstrings are also frequently specialised by inheriting classes. Note that for convenience, doc strings are not overridden, but extended/specialised by default, and any overriding doc string shall explicitly state if inherited doc strings shall not be considered. These overrides in NeXus contradicts and forbids the possibility of flattening and the simplified use of citesGroup relationships between definitions. Instead, the original class definitions must be kept (even inside Base Classes and Application Definitions where a new class is defined as a Group using the notation type=) and the IS_A relationship must be properly preserved, just like in case of using the notation extends= which declares explicitly a new subclass. A Group definition does not only results in a new subclass, but simultaneously declares a HAS_A relationship between the nesting class and the new Group definition.

CARDINALITY Tightly linked to a HAS_A relationship, NeXus definitions can specify cardinality by setting required=, recommended=, optional=, minOccurs= and/or maxOccurs=. Strictly speaking, setting recommended=true, or optional=true, or minOccurs=0 changes a HAS_A relationship to a MAY_HAVE_A relationship. Similarly, multiplicity can also be set by maxOccurs. Setting it to a value bigger than 1 changes a HAS_A relationship to HAS_AT_LEAST_ONE_OF and a MAY_HAVE_A relationship to MAY_HAVE_SEVERAL_OF. Setting minOccurs to a value bigger than 1 changes the relationship to HAS_SEVERAL_OF. In the not so rare case in NeXus when a subclass definition does not extend or override the referenced superclass, but only sets cardinality, the new subclass shall be registered in the ontology as a SYNONYM of the superclass. This solution is preferred over leaving the given subclass out of the ontology and using only its superclass instead, especially when a specific name is provided for the new subclass.

ATTRIBUTES Although NeXus Fields are preserved in the current implementation of the Ontology, Attributes are not captured, which is a serious weakness because lots of important metadata is stored in Attributes. These includes standard ones, like @default (needed for the representation of default plottobale of a dataset), @depends_on and others required for NXtransformation, but also one of the most important information for any measurable is its unit that is also stored in the Attribute @units. Fields are represented as owl:ObjectProperties which is a perfectly suitable solution also for Attributes. As an Attribute in NeXus always belongs to either a Group or a Field, Attribute names in the Ontology can be prefixed by the name of their parent. Note that contrary to Fields, NeXus Attributes cannot be annotated by unit categories in NXDL. Units if needed are stored in an additional attribute postfixed with ‘_units’, but this convention s not used to specify expected unit categories. As a result, Attribute properties cannot be automatically connected to unit categories. In some cases, like NXtransformation/AXISNAME@offset_units, the doc string provides indications on the expected unit category (like NX_LENGTH here).

Rules stored in NeXus doc strings, like the above example for specifying the unit category of an Attribute, must be manually harvested and added to the Ontology. It is recommended to enhance NXDL to support registering such rules, so future versions of the NeXusOntology can be automatically generated from the definitions in NXDL.

sanbrock commented 2 years ago

https://protege.stanford.edu/publications/ontology_development/ontology101.pdf https://arxiv.org/pdf/1902.08251.pdf

sanbrock commented 2 years ago

UNITS NeXus defines several unit categories, like NX_LENGTH, NX_TIME, etc. but also complex ones, like NX_PER_LENGTH (e.g. 1/m) or NX_FLUX (e.g. 1/s/cm^2). Data instances shall come together with their metadata 'units' specifying which unit the values are provided in. Although NeXus does not provide tools for verifying if the provided unit really belongs to the expected unit category, any proper data verification tool should include such checks. To support such check, NeXus definitions of the unit categories not only contain a doc string for humans, but also a child examples that lists one or more acceptable units which can serve a base for an automatic reasoner. Hence, units can be verified either by reasoners based on these examples, or by connecting NeXus unit categories to other ontologies which contain the appropriate reasoners.

mkuehbach commented 2 years ago

https://dl.acm.org/doi/10.1145/3308560.3317707

Wrt to the above-mentioned ideas and summary (which captures well the current issues with the existent NeXus ontology and its rule set) it should be said that the motivation behind the description of the status quo was to collect a set of features which can/have to be implemented first of all to make the NeXusOntology compliant with the rule sets which are currently already encoded via NXDL implicitly through defaults and explicitly through e.g. existence statements.

Apart from this important implementation task there are other points to consider though for applying NeXus as a tool for standardizing data formats / data records within the FAIRmat project and NOMAD OASIS specifically, here is my view on this, mainly adding on what was written above: 1.) Ontology development - Key question: What should the ontology be used for?

As a tool to verify if an instance of a data record (a file, a database entry) complies with a specific version of a standard. "It should not necessarily be the consuming scientific application only which is capable to tell users if a given NXS file matches but an independent tool, a verifier."
This verification should run automated.

For the paper we have to be more specific with our terms, for the discussion on the issue here, it suffices for now.. E.g. what is a data record? "A data record is a collection of numerical data and metadata which are organized according to a schema." A NeXus file is data record with a content that represents the instance of a schema. We have to define these terms to not cut off the paper from a general readership.

2.) A different use case for how a (NeXus) ontology can be useful, a use case whose implementation feels "doable" is:

Use the ontology as a formal description to represent links across data records. This goes along the lines of Andrea Albino's talk at the MGI2022; based on feedback from area A, this how many users would like to interact with their data.
Whatever decision is made in the FAIRmat project, the focus of the NeXus ontology development from our side should be along the lines "how can it serve the NOMAD OASIS product/implement as a technical tool to support us with standardizing materials science experiments and simulation data records and verify consistence."

3.) From a graph point of view an instance of the ontology should have (optionally) temporal pieces of information associated with each/selected vertices. Verification should be able to include temporal inference also.

To me a fundamental question is whether a logic for such a verification is encoded in the ontology instance itself or whether the ontology as an instance is only the graph with the elements (vertices) and the logical relations (edges). A simple example is say a base class that should serve as a container to store a set of instances of different geometric primitives. The base class would have several may_have relations. But what implements the rule set which defines which of these (and how many of these may_have) relations have to be present in an actual instance if any? E.g. would a NeXusOntology need an has_a set_of_rules with has_a childs arithmetic operations, existence, etc.

How do we granularize the vertices of the graph. Is a vertex instance of the graph actually the class, say an NXtire, which is then a vertex that has some predefined (and mandatory) has_a relations like has_an_attribute, has_a_docstring, has_a_rule_set, has_an_existence, has_a_unit? Personally I find this distinction between field and attribute artificial when what we consider as attributes is in fact again an instance of a class which has own has_a, has_a_unit, has_a_docstring relations etc... What makes an NXobject a group in an application definition is the fact that it has a childs and precursor, if it wouldnt have both childs and a precursor it would be the root object of the graph.

CARDINALITY Consider my concrete example from the TF meeting of how to handle cases of arithmetic operations on the cardinality within groups: NeXus enables to demand a specific existence of individual instances of a class (optional, at least n times existent, or specifically n-times required) but then how can one assure that a specific combination of fields within a group is required. Say two NXtriangles and one NXsphere?

ADDITIONAL POINTS What I find missing right now in the ontology is rule set for constraints on the technical implementation that is demanded when storing an instance of the ontology and data records stored using this ontology. E.g. it can be useful, if not needed to store the precision of values stored in vertices, e.g. like min_precision, used_precision, value ranges, encoding and endianness set. Otherwise the ontology would not be self-descriptive for a data record. That brings the general question, can the ontology represent an automatically verifiable logical essence of a data record which can be used as an own small information agent telling me about the entry, e.g. exists an NXbeam, what is the cardinality of NXuser?, does the value in NXwavelength > 0, what is the unit etc. what is the unit category of the wavelength field. These sort of things.

FAIRmat-NFDI / data-modeling

3d Generation of NeXus ontology #43